See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/229836720 


Algorithmic Inference in Machine Learning 


Book - January 2006 


CITATIONS 
24 


3 authors: 
B. Apolloni 
University of Milan 
222 PUBLICATIONS 1,309 CITATIONS 


SEE PROFILE 


Sabrina Gaito 


University of Milan 


127 PUBLICATIONS 1,373 CITATIONS 


SEE PROFILE 


Some of the authors of this publication are also working on these related projects: 


READS 
180 


Dario Malchiodi 
University of Milan 
100 PUBLICATIONS 532 CITATIONS 


SEE PROFILE 


Project Multicriteria Data Structures and Algorithms: from compressed to learned indexes, and beyond. View project 


Project Discovering genes responsible of genetic disorders through machine learning techniques View project 


All content following this page was uploaded by B. Apolloni on 28 May 2020. 


The user has requested enhancement of the downloaded file. 


ResearchGate 


It is unquestionably the place of the universe, 
to which the motion and position 

of all the other heavenly bodies 

are compared... 


[N. Copernicus, De Revolutionibus Orbium Coelestium, Book 1, Chapter X] 


Preface to the 2-nd edition 


After the first edition of our volume sold out, we decided to prepare a second 
one making some major modifications and/or improvements. They consisted 
in: i) removing typos and mistakes in general (mandatory), ii) enhancing the 
distinguishing features of the theory (necessary), and iii) completing the con- 
tents so as to have a true textbook for university-level courses (appropriate). 
In carrying out the first task we aimed to fall below the rate of 0.04 mistakes 
per kilobyte. This is a typical rate in scientific text processing such as genome 
sequencing; thus using it as a quality indicator represents a first operational 
use of the theory, and a lofty excuse for any residual errors as well. In view 
of reactions to numerous research papers published in the meanwhile, we are 
not sure on the contrary if we will succeed in convincing most academic peo- 
ple that our approach is not a sloppy variant of the Bayesian one but presents 
instead a totally different perspective. We did try our best, however, outlin- 
ing specific conceptual differences. In a true essence, these differences lie in 
having as the primary goal the discovery of structures underlying data, getting 
probabilities as a corollary. To fulfill our third objective, we devoted particular 
attention to links with the classical approach based on the sample distribution 
laws. We highlighted where the latter is more convenient than ours, and, vice 
versa, where ours provides more useful tools. We added examples and exercises 
to help the reader grasp these features in full. In this way, with chapters 1 and 
2 and appendices A and B we dealt with nearly all the topics covered in a basic 
probability and mathematical statistics course. The remaining chapters and 
appendices comprise methodological parts of courses in the sphere of Intelligent 
Systems. In conclusion, we hope with this new edition to have improved the 
way of communicating a new idea of dealing with random phenomena that fits 
with the currently available computational power. It results in more appeal- 
ing statistical tools to be taught at various levels in both basic and advanced 
university courses. Researchers should be the first to use these tools, to set 
up learning paradigms driven by the data structure as a modern evolution of 
statistical inference algorithms. 


Milano, April 2006 Bruno Apolloni, Dario Malchiodi, Sabrina Gaito. 


vii 


vill 


Preface 


Taking a look at the list of incoming messages in the new mailbox you activated 
fifteen days ago, and wanting a grasp of how much you will be disturbed in the 
future by unsolicited e-mail, you may quantify the annoyance in the number of 
messages of this kind you receive over the next year. You naturally relate this 
mailbox property to the number of such messages you received during the first 
work period. Now, as you don’t know what will happen in the future you must 
give a likelihood measure to any assertion you can make about the next e-mails. 
To this scope you rely on the sole assumption that the mail server profile will be 
the same. This is an effective example of the framework where we develop the 
theory and operational tools described in this book. Instantiating a property 
on new observations of a given phenomenon will be a mere matter of relating 
this property to our previous experience of the phenomenon, i.e. to the value a 
suitable function of past observed data assumes. The likelihood we will get of 
the property is a measure of our capacity to describe these data in terms of the 
knowledge we have of the phenomenon at hand (e.g. the habits of spammers). 
In asymptotic scenarios this measure coincides with the frequency with which 
the property is verified, thus letting us exploit most results of the canonical 
probability theory. 

If you consider in this scenario a never ending sequence of future years of 
your mailbox life (ever beyond your own lifetime or any technological system 
update), you may realize that it holds all possible yearly histories, and that 
what you saw in the first fifteen days, though concerning the past, is still re- 
produced in a piece of the sequence. You don’t know exactly what piece and 
therefore figure it as a sample drawn casually from the sequence. Meanwhile you 
assume the frequency of unsolicited mails in the infinite history to be a good 
representative of the analogous frequency in the next year you are interested 
in and try to approximate the former (now a fixed parameter) with a quantity 
computed on the sample. We will follow the nice metaphor of a God generating 
random data by tossing the dice when we want to build theoretical probabilistic 
models and, of course, identify likelihood with probability. But we return to our 
original operational framework when wanting to state properties of our future 
observations. 

A good compromise that we get for preserving the rationale of this framework 
while still exploiting most results of the canonical one is based on a special model 
of random variables. They are produced by suitable algorithms having in input 


ix 


x Preface 


a standard source of random noise. Thus the object of our inference is precisely 
one of these algorithms about which we do not know anything or simply some 
of its parameters. We require that it underlies both the sequence of values we 
have already observed and those we will meet in the future. Hence, we call this 
theory algorithmic inference. Exploiting this approach to the inference passes 
through the solution of inverse problems. The required abilities for the user 
are an attitude to state rigorous logical relations between abstract properties 
and observed data, with nothing to pay to comfortable work hypotheses, and 
facility in massive and sophisticated data processing — a task which is often 
feasible today, thanks to the fast computers available, but which prevented the 
development of this approach in the past. 

Philosophical propensities apart, the framework we propose lets us override 
and improve the usual results of the statistical inference theory based on the 
above metaphor. It also deals with the new inference paradigms undertaken 
by the computational learning theory in a very easy and understandable way. 
Moreover, it lets us solve complex regression problems and offer operational 
hints to wide families of subsymbolic learning algorithms implemented on neural 
networks. 

The volume is self-contained, which makes it suitable also for readers not 
yet acquainted with classical mathematical statistics or learning theories. The 
philosophical and methodological approaches are definitely new, but the focus of 
the book is on their operational implications. We check them in many numerical 
examples drawn from usual inference instances. We split our work in two main 
parts. First we present the core of our theory and solutions to the most common 
estimation problems. The second part focuses on the bases for learning either 
Boolean or continuous functions, neural networks included. Four appendices 
have been written as well for enabling the reader to solve (within our approach) 
the main inference and computational problems he/she faces in a regular course 
of mathematical statistics. They also supply him/her with the usual technical 
background in this field. 


Milano, September 2003 Bruno Apolloni, Dario Malchiodi, Sabrina Gaito. 


Acknowledgments 


The book encompasses a series of ideas and experiments developed throughout 
the life of the Laboratory for Neural Networks at the University of Milano. 
Thus the authors thank all researchers and students who spent their talent 
in the lab activities. We mention in particular colleagues Diego de Falco and 
Anna Maria Zanaboni who shared with the authors many years of teaching and 
research in the fields of Probability and Cybernetics, not necessarily adopting 
the same theoretical framework. We also mention the students Simone Bassis, 
Andrea Brega and Michele Scaramal who contributed directly to discussion 
of some theoretical issues and implementation of some numerical experiments. 
We address a special thanks to Prof. Vincenzo Capasso of the Mathematics 
Department of our University, and to Prof. Ettore Marubini of the Istituto 
Nazionale per lo Studio e la Cura dei Tumori of Milan, and his cooperators Elia 
Biganzoli and Patrizia Boracchi, the first to welcome our different perspective 
on statistics. Lastly and primarily, we thank Prof. Lakhmi Jain of the KES 
Centre, University of South Australia who as editor of this collection of books 
encouraged us to carry out the work. 


Milano, September 2003 Bruno Apolloni, Dario Malchiodi, Sabrina Gaito. 


xi 


Contents 


I Foundations 


1 Knowledge versus randomness 


1.1 Fair properties that hold for sets of uncertain data ........ 
1.1.1 Eventalgebra...........2....2-..00 2.00054 
1.2 Probability measures... 2... 0.0... 0... 00000000. 
1.2.1 Combinatorial measure of probability ........... 
1.2.2 Kolmogorov axioms of probability ............. 
1.2.3 Exploiting knowledge to build probability models... . . 
1.3. Random variables 21.20 -o-34 bee OR No ea Pe eS 
1.3.1 Standard ways for describing random variables ...... 
1.3.2 Fixed properties in a sequence of variable specifications 
1.3.3 From single to many bits valued variables ......... 
1.3.4 Beyond the Bernoulli experiment ............0.. 
1.3.5 Aggregates of random variables... ..........0.. 
1.3.6 Functions of random variables and their generators... . 
1.3.7 Limit theorems ....................00.0. 
1.4 Bibliographical notes and further readings ............. 


Algorithmic inference 
2.1 The predictive approach: a string of bits partitioned by a cursor 
2.1.1 The inference approach ..................4. 
2.1.2 A universal sample generator ................ 
2.1.3. In search of compatible populations ............ 
2.1.4 Bootstrapping populations ................. 
2.1.5 Deriving analytical expressions .............0.. 
2.1.6 Pivots and sufficient statistics . . . . . sooo a 
2.2 Confidence intervals o + + es po acarrea cc trare i 
2.2.1 Solving inverse problems to find the confidence interval 
GXUBCIMNCS! 5 .sbo3 kG) a) de hee de HH hh, Se eel we E a Pw a 
2.2.2 Checking the coverage a nna tamana ana a g aaa 
2.2.3 Devising confidence regions . . .. aaoo aaa 
2.3 Point estimators sasssa a io ao o eRe e ee Ae 
23-1 iSQUare erior. oe si e i eee d a e a a Se Ee a 
2.3.2 Shortening the procedures... ............0.. 


xiii 


CONTENTS 


xiv 

2.3.3 Adequacy of the sample size... ...........0.. 127 
2.4 Bibliographical notes and further readings ............. 132 
II Machine learning applications 135 
3 Computational learning 137 
3.1 Learning Boolean functions ..................00. 139 
3.1.1 Error measure distribution law ............0.. 140 
3.1.2 Learning rectangles... ............-.-.-2--- 142 
3.1.3 The PAC learning goal..................0.. 145 
314 Sentry Points «2-5-2464 46 29s oe eee Pee Seas 147 
3.1.5 A twisting argument for learning .............. 153 
3.2 Computing the hypotheses... ...............2008. 159 
3.2.1 Identifying consistent hypotheses .............. 159 

3.2.2 Relaxing the consistency constraints. The structural risk 
MINIMIZATION. 6-3, eee ee Be ee ees BREE oY ee 164 
3.3 Further biases... .. 2... pa faoa ie ee 166 
3.3.1 Distribution dependent complexity indices. ........ 166 
3.3.2 Sentry points vs. support vectors.............. 167 
3.3.3 On-line learning seria oee da d oep i a o a a O A 169 
3:4: Controlling the error 2 sps setta a tkd ka Eaa a a d 172 
3.4.1 Confidence intervals for the learning error ......... 172 
3:4:2 Sample complexity o s s-s ee ts aco apoa iea 177 
3.4.3 Point estimators for the learning error ........... 182 
3.5 Splitting the sentineling functionality: learning by pair of concepts183 
3:6: Ibearnabilty 6:6).0 oe ke an eke oO ae ee RHA BROS 188 
3.7 Bibliographical notes and further readings ............. 191 
4 Regression theory 195 
4.1 The reference problem ............... 0555. e eee 196 
4.2 Linear regression. o oos s sraa Re pa a nan ae 198 
4.2.1 Learning a confidence region for a regression line .... . 199 
4:2.2 Curve Distribution >s s sos s aa ma esa s a t a pii 200 
A23 PIVOT ea see ea ee DERG BLEED dai a i ee 206 
4.2.4 Nested Regions ... aa ouea 207 
4.2.5 Confidence region for the dependent variable ....... 211 
4.2.6 Implementing the procedure... . ooa 212 
4.2.7 The multidimensional case... . . ooa 218 
4.3 Moving to non linear functions .. oaoa 219 
4.3.1 Confidence intervals for the hazard function of survival data220 
4.3.2 Dl representation oras e s oei d a eea a E a a ala 222 
4.3.3 D2 representation pos ssaniu u eea d eaa 225 
4.3.4 Checking the method. ...................4. 227 
4.3.5 A non linear mixed effects regression model ........ 230 
4.4 Adequacy test of a model ...............22.004. 233 


CONTENTS XV 


4.5 Point estimators ... 2... 2. ee eee ee 234 
4.6 Estimation efficiency ............. 02.0000 eee 237 
4.7 Bibliographical notes and further readings ............. 238 

5 Subsymbolic learning 241 
5.1 A very essential taxonomy of neural networks ........... 244 
5.2 Compression of a set of data. ...............-.000.- 251 
5.3 In search of a sufficient statistic... 2... 2... .2022.000. 254 
o4 Learning strategies o o o gaani nk oa ee ee ee Boa a g 259 
5.4.1 The perceptron theorem .................4.- 260 

5.4.2 Minimization methods .................... 263 

5.4.3 Dimensioning training and test sets ............ 287 

5.4.4 Piecewise building of the pivotal statistics ......... 289 

5.4.5 Hybrid learning systems ................-.. 294 

5.5 Learning from no teacher .. 2... 2.2... .020. 00.0008. 296 
5.5.1 Self-associative memories ...............-4- 296 

5.5.2 Self-organizing memories. ...............0.-. 298 

5.6 Bibliographical notes and further readings ............. 299 
III Appendices 303 
A Combinatorial calculus 305 
A.1 Managing configurations... ......0.0.0. 0.00000 000. 305 
A.2 Taking samples probabilities... . o0 a 312 

B Random variables 319 
B.I Distribution laws: e s 44.1.4 22 65%. «su 4 ene ee oes 319 
B.2 Computing a sample of i.i.d. random variables. .......... 326 
B.2.1 Samples from a uniform continuous random variable in [0,1]326 

B.2.2 Computing c.d.f. from data ssw s s e sea a apai 327 

B.2.3 Generating discrete random variables ........... 328 

B.2.4 Generating continuous random variables .......... 330 

B.3 Basic random variables. . .. naaa aae 331 
B.3.1 Discrete random variables . . . o.oo 332 

B.3.2 Continuous random variables ................ 342 

B.4 Joint random variables . . .. ooo aaa 20000005 352 
B.4.1 The discrete case . ... ooa ee ee 352 

B.4.2 The continuous case sc s see sruse sodat etes 363 

B.5 Functions of random variables . . . oaoa e 367 
B.5.1 From Fixi, Xn) to F(X, Xn) BaD a I Oa a E a G a a a 368 

B.5.2 From fx to fax) bike Gace ie Ge See Be Be gt e Ee ce S 369 

B.5.3 Using the moment generating function... ........ 370 

B.5.4 Broad identification of transformed variables ....... 372 


6. -HxXercises: 4-4 we 4 Se aed ee ee eh a Be ak oS SS 374 


xvi 


C Markov Chains 
D Computational complexity 
Bibliography 


Index 


CONTENTS 


381 
385 
391 


407 


Part | 


Foundations 


1 — Knowledge versus randomness 


Continuing with the example introduced in the preface, imagine that you in- A mailbox 
stalled a new mailbox fifteen days ago and you are now trying to guess how large ee 
a portion of the incoming e-mail will be unsolicited. Perhaps in one message 
someone proposes to sell you a house, in another a company offers its financial 
services, etc. You open the inbox and find, for instance, 20 messages sent to you 

so far. You don’t know which ones are useful, which ones are not. Thus you 

open the messages in any order, starting from the latest, the shortest, whatever. 

In the end you will have a record of 20 bits having 1 in the it? position if the 

it e-mail you opened proved unsolicited, 0 otherwise. Then you wonder how 
annoying it will be to check your mailbox in the future. Rather, to be more pre- 

cise you want to get an idea of how many unsolicited messages you will receive 

in the subsequent year. 


1.1 Fair properties that hold for sets of uncertain 
data 


Since you are investigating something that at present does not exist, and are 
not relying on any magician’s crystal ball, you should decide to establish a 
reasonable protocol with which to deal with this or similar uncertain situations. 
The protocol we propose consists of the following three rules. 


e CONSISTENCY RULE. Any function of the observed data you will consider Consistency 
to argue properties about the future must depend on all observed data. 
Thus once you discover an uninteresting promotional message in your 
mailbox, you may be disappointed, but you must register an additional 
unit on the number of junk messages therein observed. 


e UNIFORMITY RULE. The shape of any function of the observed data you Uniformity 
will consider to argue properties about the future must not depend on the 
number of the observed data. Accordingly, you cannot for instance decide 
to multiply by three the number of observed 1’s if the total number of 
incoming messages is over seven. 


e SYMMETRY RULE. The value of any function of the observed data you will Symmetry 


3 


4 Knowledge versus randomness 


consider to argue properties about the future must not change with any 
permutation of data that does not change their meaning. This means that 
you must not get anxious about looking for the most suitable order by 
which to browse the inbox if you have no actual reason for preferring one 
order over another. In this case the assertions you may make about the 
whole set of 20 e-mails must be independent of it. Things are different if 
for instance you are interested in the distribution of annoying mail during 
daytime hours: now abandoning the temporal order should be misleading. 


The last rule is particularly delicate, since it involves the knowledge we 
have about a phenomenon. To be sharp, and in accordance with the Church- 
Turing thesis !, in this book we will identify knowledge with a code operationally 
describing it in our own computer and properties with all that can be computed 
by the code. Thus knowledge becomes a strict matter of who observes the 
data. Conversely, unawareness rises from his incapability of distinguishing in a 
useful way the data he is processing. We denote by computational context his 
computing facilities, including hardware, software and set of possible outputs 
of his interest. Within this context the above incapability identifies with the 
absence of a tool computing different outputs on different subsets of the data. 
More precisely: 


Computational Definition 1.1. Given a sequence of observed data s, (a string in the computer 
context and science terminology) we call computational context C the support where the 
ensemble data are reported, the program library available to process them and the set = 

of possible program outputs as well. We denote symmetry ensemble Ile (s) w.r.t. 

© C & the set of all possible listing sequences ø of the data written in s such 

that for no pair o’,o” € II(s) a computation c € C exists such that c(o’) = 0 


and c(o”) £6 fora 0 € ©. 
E 


Remark 1.1. A different index is associated to each observed data item in s. 
Hence, notwithstanding their contents, two listing sequences are different if the 
index sequences are different. 


A very crowded Example 1.1. Let the binary encoding of our first 5 observed data be 10101. 
emble This means that of the first five messages specifically the first, third and fifth 
proved unsolicited. To enhance the positional notation we will rewrite the string 
as 1102130415. The quantity we are interested in is a forecast K of the number 
of next unsolicited messages and our ignorance of this quantity does not change 
if we consider the data in a different order. Restated according to Definition 
1.1, we have no computer code at present for discriminating one sequencing in 
respect to another in order to output this quantity. 
~ 1According to the Church-Turing thesis [Kleene, 1967], every computation you can think 


of can in principle be performed by a usual computer, except possibly for problems related to 
lack of memory or unbearable time requirements. 


Fair properties that hold for sets of uncertain data 5 


Table 1.1: Symmetry ensemble for the sequence 1102130415, where each bit codes a 
single feature. 


1102130415 0204151113 0402131115 1102131504 
0204151311 | 0402131511 | 1102041315 | 0215111304 
0402151113 | 1102041513 | 0215110413 | 0402151311 
1102151304 | 0215131104 | 0413110215 | 1102150413 
0215130411 | 0413111502 | 1113020415 | 0215041113 
0413021115 | 1113021504 | 0215041311 | 0413021511 
1113040215 1311020415 0413151102 1113041502 
1311021504 | 0413150211 | 1113150204 | 1311040215 
0415110213 | 1113150402 | 1311041502 | 0415111302 
1104021315 | 1311150204 | 0415021113 | 1104021513 
1311150402 | 0415021311 | 1104130215 | 1302110415 
0415131102 | 1104131502 | 1302111504 | 0415130211 
1104150213 | 1302041115 | 1511021304 | 1104151302 
1302041511 | 1511020413 | 1115021304 | 1302151104 
1511130204 | 1115020413 | 1302150411 | 1511130402 
1115130204 | 1304110215 | 1511040213 | 1115130402 
1304111502 | 1511041302 | 1115040213 | 1304021115 
1502111304 | 1115041302 | 1304021511 | 1502110413 
0211130415 | 1304151102 | 1502131104 | 0211131504 
1304150211 | 1502130411 | 0211041315 | 1315110204 
1502041113 | 0211041513 | 1315110402 | 1502041311 
0211151304 | 1315021104 | 1513110204 | 0211150413 
1315020411 | 1513110402 | 0213110415 | 1315041102 
1513021104 | O213111504 | 1315040211 | 1513020411 
0213041115 | 0411021315 | 1513041102 | 0213041511 
0411021513 | 1513040211 | 0213151104 | 0411130215 
1504110213 | 0213150411 | 0411131502 | 1504111302 
0204111315 | 0411150213 | 1504021113 | 0204111513 
0411151302 | 1504021311 | 0204131115 | 0402111315 
1504131102 0204131511 0402111513 1504130211 


Therefore the symmetry ensemble IIx (s) is constituted by the 120 strings in 
Table 1.1. 


Example 1.2. If the e-mail interest score is now expressed by four values from 0 
to 3, which we code in the binary alphabet, the observations sequence is still a 
binary string where every outcome will correspond to a pair of bits. For instance 
if the scores 0, 1 and 2 are observed, the sequence will be 010203141506. The 
corresponding symmetry ensemble, as illustrated in Table 1.2, is now constituted 
only by the possible permutations of the above pairs. 


A less crowded 
symmetry 
ensemble 


6 Knowledge versus randomness 


Table 1.2: Symmetry ensemble for the string 010203141506, where pairs of consecutive 
bits code a single feature. 


1.1.1 Event algebra 


Now let us settle a first formal tool enabling us to describe the fact that a 
message may prove annoying or not. We do this simply by analyzing histories 
of these results and identifying some of them with a hypothetical procedure for 
their generation that we call experiment. As usual, the framework is that we 
have collected n data on a given phenomenon and will keep carrying out new 
observations. We may imagine each new data item as an experiment output 
that we call elementary event. We don’t know how the story continues, other 
than it will be a suffix of the actually observed data ?. 


Feature space, Definition 1.2. For a given observation sequence s, we call feature space Q(s) the 
experiment space ot constituted by all the distinct elements in s (independently of their order in 
the sequence), event each subset of this space and elementary event every event 
whose cardinality is 1 3. An experiment is a family of observation sequences, 
where each element in the sequence belongs to an experiment space Q’ (s) D Q(s) 
and each sequence in the family has s as prefix. By abuse of notation we identify 
an experiment with a procedure generating sequences from the above family in 
any order. At each step the experiment outputs an elementary event in (s). 
We say that an event is verified if any of the elementary events contained in it 
is observed. 


Atemplate Example 1.3. If the phenomenon we are observing is the tossing of a die and the 
experiment: die obtained sequence after four observations is s = (6,5, 3,6), possibly coded as in 
Example 1.2, then Q(s) = {3,5,6}. Excluding the possibility that the die does 

not fall on one of its faces, it breaks when hitting the floor, etc., we can assume 

Q'(s) = De = { “the outcome is 1”, “the outcome is 2”, “the outcome is 3”, “the 

outcome is 4”, “the outcome is 5”, “the outcome is 6”} which we recode for 

short in the set /(s) = {1,2,3,4,5,6}. Given the observation sequence s, the 

experiment consisting in tossing the die seven times is identified with the family 


2 As usual, by prefix and suffix of a string within a sequence we mean the parts respectively 
preceding and following the string. 
3In our context, the cardinality #A of a set A is simply the number of its elements. 


Fair properties that hold for sets of uncertain data 7 


of sequences s1,...,S216, each having the form (6,5, 3,6, 01, 02,03), where o; can 
assume the integer values between 1 and 6 (and in fact the number of sequences 
will be 6° = 216). If s is the empty string €, the experiment identifies with the 
family of sequences $1, . . . , S279936, each having the form (01, 02, 03, 04, 05, 06, 07). 


In an extension of notations from computer science, we can state that the fea- 
ture spaces are statistically enumerable sets yet not recursive *. In fact, if we 
study a phenomenon for a given time we can assume as current feature space 
the set of different observed values. But no algorithm exists which, having in 
input any set of such observations, can give as output the feature space at a 
forthcoming time (i.e. can define whether a new element will belong in the future 
to the feature space). Rather, the feature space desumed from n observations 
might get bigger after N subsequent observations. 

On the other hand, when we are discussing the phenomenon per se, we are inter- 
ested in referring to an experiment outcome space Q independent of the actual 
observations. This experiment free of biases coming from previous observations 
we call model. That is what we got in the previous example when we considered 
Q'(e). More in general, we can always obtain Q from any Q(s) just by adding 
an anomalous event to the list of elementary events. This event will cover all 
we will have not yet seen but will appear in the suffix of our actually seen se- 
quence. When we build a model we substitute this event with a set of events 
that we logically admit as possible yet unseen outcomes of our experiment. For 
instance, starting from a feature space (Q(s) = {3,5,6} after four die tosses we 
have a first outcome space Q = {3,5,6,a}, where a codes the anomalous event. 
Then we adopt a model where we do not consider the possibility that a die may 
break or balance over an edge and we assume that the faces with numbers 1, 2 
and 4 can also appear, getting a new Q coinciding with Dg as discussed before. 
In so doing, we push the possible discrepancies between past and future a step 
up. If new observations do not fit in the outcome space, then we say the model 
is unsuitable. 

Of the three ways of framing the observation content, Q(s),0/(s), and Q, 
throughout this chapter we will adopt the latter since we want to explore how 
to suitably model a phenomenon. We will return to experiment and feature 
spaces, at least in principle, when we come to the main goal of the book, i.e. 
designing procedures for inferring properties of the future observations from 
features of the effectively observed data. Summing up: 


Definition 1.3. Given a phenomenon we denote outcome space 2 the experiment 
space '(e), where € is the empty string. 


“For a given universal computing machinery (for instance our computer, see Appendix D), 
a set A is recursive if there exists a program that, having in input a generic item a, outputs 1 
if a € A and 0 otherwise. A is recursively enumerable if there exists a program whose output 
is a sequence listing the elements of A in some order, possibly repeating some of them. 


An experiment 
space 
independent of s 
to model a 
phenomenon 


Outcome space 


One 
phenomenon, 
more outcome 
spaces 


Event algebra 


Three rules for 
managing events 


From one event 
to more 
sub-events 


8 Knowledge versus randomness 


Even after these interventions the phenomenon of tossing a die has no unique 
space Q. It depends on what we are interested in observing. If we are simply 
watching a bet on the outcome of the die throw, we will probably be generically 
interested in the result of every single toss and therefore the outcome space will 
be Q = Dg. If instead we are betting on a non elementary event, for instance on 
the event “the outcome is odd”, our outcome space will be Q = { “the outcome 
is even”, “the outcome is odd” }. In other words, we say that our optical sensors 
for looking at the bet can be equipped each time with a different dictionary in 
order to map the received signal to a symbolic description °. 

Therefore we focus now on the way we must formally characterize the events 
in order to both easily relate two different outcome spaces referring to a same 
phenomenon — as we remarked in the case we toss dice — and discuss these spaces 
with other people who didn’t see the data at all. Namely we require that the 
set A of events over an outcome space 2 must fulfill the following definition 
qualifying it as an event algebra: 


Definition 1.4. For an outcome space 2, a collection A of events over Q is an 
event algebra if 


1. QE A (Q too is an event); 


2. A° € A whenever A € A (what is not contained in A can be described 
through elements of A) ©. In particular an observation that does not fit 
with the elementary events listed in Q belongs to its complement, i.e. the 
empty set Ø w.r.t. Q. Since Ø = °, this set is an event of any A 7; 


3. AU B € A whenever A € A and B € A (compounding however pieces of 
Q of interest to us still gives an event). 


The above properties could define events independently of Definition 1.2. 
On the contrary we prefer starting from the latter and considering the above 
as suitable properties to deal with the events therein defined. This means that, 
to enhance our intuition and meet general thoughts, we expressly require that 
elementary events belong to the algebras we are dealing with 8 . In this way we 
can easily relate an event to its subsets in Q (i.e. sub-events). In particular we 
can split each event A into a subset of it B € A, a subevent of interest to us, 
and the complement B with respect to A, i.e. what we disregard in A; then we 


5Later on we will see how the problem of determining the most suitable dictionary is 
equivalent to that of finding an optimal statistic, which amounts in turn to solving our general 
inference problem. 

6We will denote by A® the complement of A with reference to Q, i.e. the set Q — A 
constituted by the elements of 2 not belonging to A. 

"For Q = Q(s) (the feature space) this property says that the anomalous event is an event 
of any A as well. 

8Still referring to a die, we might be interested in an algebra whose events are: A, = 
{1,3,5}, Ao = {2,4,6} and A3 = {2,3, 6} ad all those events whose membership is implied by 
Definition 1.4, so that the elementary event e2 = {2} does not belong to the algebra. 


Fair properties that hold for sets of uncertain data 9 


Table 1.3: Elementary properties of the events. 


[Complement __ | A =2- A AUR AnA A 


Difference A-B=ANB° 

A-— B = A whenever B = 0 

A -— B = Í whenever A C B 
Operations with Q and @ | AUQ =Q; AU =A 
pamos w anne aranozo | 


AUB = BUA; ANB=BNA 


Associative laws AU(BUC)=(AUB)UC 
ene aneno- Annne 

Distributive laws AN(BUC)=(ANB)uU(ANC) 
e n [aono AUB nioo 

De Morgan laws (AN B) = A°U BS 
s [Guar ann 


rebuild A as the union of the two parts. Delving further into this partitioning, 
we recognize that A occurs (A is true in terms of the first order logic formalism 
[Shoenfield, 1967]) if and only if one of its constituting elementary events is 
verified. Therefore in the die experiment the event “the outcome is even” is 
verified if “the outcome is 2”, “the outcome is 4” or “the outcome is 6” occurs. 

Oppositely, using these three properties we can relate more events among From elementary 
them through the usual theoretical set operations reported in Table 1.3 (proofs ro apaug 
are left to the reader as an exercise). This allows us to recover the freedom of 
considering only the events of interest to us — a main feature of the definition 
of events via the sole Definition 1.2 — by tailoring outcome spaces focused right 
on them (see previous discussion on dice bets). 

The reader can easily see that if the outcome space has cardinality N € N 
and we are able to consider each element of this space separately, then the 
related event algebra is unique and has cardinality 2, i.e. is the power set of 
Q 9. Actually, any mathematician would suggest that, for an outcome space of Sigma-algebra 
infinite cardinality the fact that AU B € A whenever A € A and B € A, while 
guaranteeing its extendability to the union of an arbitrary number of events 
in A, cannot ensure that the same property holds for the union of an infinite 
number of events. We must take this warning into account when for instance 
we want to approximate our observation string with a neverending sequence. If 
however property (3) in Definition 1.4 is substituted by 


(3) UIPA; € A whenever A; € A for each i € N (compounding an infinite 
number of events in Q still gives an event) 


the above requirement is fulfilled. In this case we say that our collection A is a 
sigma-algebra, that we denote by X. 


°Where the power set of a set A, usually denoted by 24, is the set constituted by all subsets 
of A. 


An algebra that 
is not a 
sigma-algebra 


Forecasting 
about the next 
observation 


If you have never 
won in 20 
„contests it is 
unlikely that you 
will win in the 
next one 


10 Knowledge versus randomness 


Example 1.4. (Billingsley, 1995] Consider the outcome space Q = [0, 1), i.e. the 
interval ranging from 0 to 1, including the left extreme but excluding the right 
one. The collection A gathering all the unions of a finite number of intervals 
[a1,@2) with a1,a2 E€ Q and the empty set is an algebra over the latter set, 
as it fulfills the requirements of Definition 1.4. The same collection cannot be 
qualified as a sigma-algebra: indeed, for i € N the set A; = [1/2*,1) is an 
element of A, but USS? A; = (0,1) g A. 


1.2 Probability measures 


Now you know how to give a structure to your observed data, so that they 
are sharable with people who didn’t observe them. The subsequent problem 
is to exploit this knowledge for guessing something about the future of your 
mailbox. And since you want to discuss it with the above people, your guess 
must be a function of exactly the observed data and any computational context 
you can share with your interlocutor. Moreover you are not discussing the 
metaphysics of a mailbox. Rather, you want to draw from this guess operational 
consequences for your life, obviously as far as your mailbox is concerned. People 
are well prepared to deal with an uncertain future and asymptotic measures 
of its possible realizations. We will consider the probability that your next 
incoming e-mail is noisy in terms of asymptotic frequency of unsolicited messages 
within a long sequence of them that you will receive. We aim to define this 
probability for finite sequences as well, so that the knowledge accumulated from 
past observations is fully exploited. 


1.2.1 Combinatorial measure of probability 


Let us start with the elementary problem of guessing something about a binary 
string, like those coding your interest in a next e-mail. To give a rationale to 
the definitions we will issue in a moment, we propose a family life episode that 
occurred to one of the authors. Last Saturday my son, who plays soccer, told 
me: “Daddy, tomorrow we will win for sure”. As a father I congratulated him. 
As a scientist I wondered how many of the last 20 league games his team had 
won. His answer to my question about it was “None”. Then to ease him through 
the disappointment of an unfavorable fate, I said: “Dear son, I know that the 
next contest will be against a team of standard ability like the previous ones. I 
also know that your team has neither increased nor decreased its skill during this 
league championship. Now, how can you expect the capability to win a contest, 
which never appeared in the previous competitions, to just materialize in the 
next one?”. Of course my son claimed that such an event cannot be declared 
impossible. Then I tried the following reasoning on him. “You say the team’s 
ability is the same as in previous contests. Then all we can get from the fact 
that the next contest will be a victory is that tomorrow, after the game, in 21 


Probability measures 11 


matches you will have had one victory. But the fact that this victory will occur 
precisely tomorrow rather than earlier is a matter of pure chance. Thus your 
team had 21 chances to get the same score (getting the victory precisely in the 
first game of an equivalent 21 games’ history, or precisely in the second, or in 
the third and so on) and only one to achieve victory exactly tomorrow. Since no 
reason appears to favor this chance over the other, we can assume the likelihood 
of the event you forecast for tomorrow to be 1/21, while it was 1/11 after the 
first 10 defeats and will drop to 1/101 if your team keeps losing for a total of 
100 games”. Of course the score of the unlucky soccer player may be quite 
different. It may be that in the first 20 matches he won once or twice, etc. We 
give therefore the following definition: 


Definition 1.5. For a given observation sequence s on a given phenomenon 
and any property A concerning a sequence of next observations, let us denote 
augmented symmetry ensemble mi (s) the union of the symmetry ensembles of 
all strings satisfying A and having prefix s. Then, let us denote by T(I}(s)) 
w.r.t. A the set of elements of H} (s) satisfying A. We call probability P(A) the 
relative frequency of these elements, namely 


#T (04 (s)) 


P(A) = “Sa () 


(1.1) 


Note that in this definition we work with two algebras related to two different 
outcome spaces referring to a same phenomenon. The former concerns the 
outcome space 2 we defined in the previous section, here related possibly to a 
sequence of more than a single future outcome; property A (for instance, “next 
two bits are 0 and 1”) is an event of this algebra. The latter refers to the 
set IT} (s) representing the union of symmetry ensembles of “virtually observed 
sequences”. Here the companion event is a subset of the strings that are of 
interest — within our computational context — if property A is verified in Q. 
Let us consider an s constituted by n observations, where k of them can be 
synthesized each with the event “a success occurred” and labeled by 1. Instead, 
the remaining ones can be referred to as a “failure” and labeled by 0. If no listing 
sequence of the k 1’s and n—k 0’s is prevented by our computational context C, 
Ix (s) is constituted by all n! permutations of these observations 1°, a condition 
that we indicate by saying that IIx (s) is fully expanded and by denoting with 
II(s) without subscript the ensemble. Questioning whether the next observation 
will be labeled by 1 too is a forecasting problem that we cannot solve with a 
crisp “yes” or “no” answer, of course. Rather, let us asswme as a hypothesis that 
the new label is 1 (property A). On the basis of our actual knowledge, it would 
imply that we obtain by s an augmented observation sequence s’ constituted 
by n + 1 observations, k + 1 of which are labeled by 1. Thus the cardinality 


10See in Appendix A the meaning of n! and the background for the combinatorial compu- 
tations in this section. 


Probability, a 
mere frequency 
measure on what 
you plan to have 
Seen 


If any 
permutation is 
allowed .. . 


12 Knowledge versus randomness 


of II(s’) = I$ (s) equals (n + 1)!. From among these elements the sequences 
that satisfy our assumption are exactly those having a 1-labeled observation in 
the last position, no matter the order according to which the remaining bits 
are listed. This means that we can use k + 1 different observations to fill the 
last position and for each selection we can fill the remaining positions with any 
permutation of the remaining n observations. Thus we count (k + 1)n! different 
sequences agreeing with our assumption, and the frequency ¢ of these sequences 
in the symmetry ensemble reads: 


g= (k+ 1)n! = (kK+1)n! k+1 


(n+l! (n+ Ini n+l (1.2) 


Therefore, according to Definition 1.5, we can state the following theorem: 


probability as Theorem 1.1. After a set s of n observations, exactly k of which satisfying a 
ratio between condition B, the probability P(B) that the next observation will satisfy B when 

total outcomes. r 5 - , j 
no structural property exists between the observations in their computational 


context is given by 
k+1 
P(B) = (1.3) 


n+l 


Example 1.5. With same notation as in Theorem 1.1, if n = 4 and k = 2, 
P(B) = 3 synthesizes the knowledge we draw from the augmented symmetry 
ensemble reported in Table 1.4. 


Analogously, if we assume that the next observation will be labeled by 0, 
the frequency Y% of the strings agreeing with this assumption, i.e. having a 0 in 
the last position, will be 


o (n=k+1)n! n-k+1 
p= (n+1)! ntl en 


Thus meaning by B® the complement of the event B as before, we obtain: 


Corollary 1.1. Fors and B as in Theorem 1.1, 


n—-k+1 
P(B°) = ——__ 1.5 
(B°) = (1.5) 
Complementary Equations 1.3 and 1.5 apparently hit our intuition, since we are used to 
events versus 7 
complementary assuming that 
ak cas P(B) + P(B°) =1 (1.6) 


But we must consider that our operational perception of P concerns infinitely 
long strings of observations, and for n and k going to infinity the above equal- 
ity holds in our framework as well. At the same time we give a rationale to 
probability also for finite strings. 


Probability measures 13 


Table 1.4: Augmented symmetry ensemble related to the probability of observing 
a new 1. Here n = 4 and k = 2. The 4! sequences corresponding to the original 
symmetry ensemble of 4 bits (except for the last dummy bit equal to 15) are marked 
with a gray background. The 72 sequences accounting for the numerator in (1.2) are 
marked in bold. 


1102130415 
0204151311 
0402151113 
1102151304 
0215130411 
0413021115 
1113040215 
1311021504 
0415110213 
1104021315 
1311150402 
0415131102 
1104150213 
1302041511 
1511130204 
1115130204 
1304111502 
1502111304 
0211130415 
1304150211 
1502041113 
0211151304 
1315020411 
1513021104 
0213041115 
0411021513 
1504110213 
0204111315 
0411151302 
1504131102 


0204151113 
0402131511 
1102041513 
0215131104 
0413111502 
1113021504 
1311020415 
0413150211 
1113150402 
1311150204 
0415021311 
1104131502 
1302041115 
1511020413 
1115020413 
1304110215 
1511041302 
1115041302 
1304151102 
1502130411 
0211041513 
1315021104 
1513110402 
0213111504 
0411021315 
1513040211 
0213150411 
0411150213 
1504021311 
0204131511 


0402131115 
1102041315 
0215110413 
0413110215 
1113020415 
0215041311 
0413151102 
1113150204 
1311041502 
0415021113 
1104130215 
1302111504 
1511021304 
1115021304 
1302150411 
1511040213 
11415040213 
1304021511 
1502131104 
0211041315 
1315110402 
1513110204 
0213110415 
1315040211 
1513041102 
0213151104 
0411131502 
1504021113 
0204131115 
0402111513 


1102131504 
0215111304 
0402151311 
1102150413 
0215041113 
0413021511 
1113041502 
1311040215 
0415111302 
1104021513 
1302110415 
0415130211 
1104151302 
1302151104 
1511130402 
1115130402 
1304021115 
1502110413 
0211131504 
1315110204 
1502041311 
0211150413 
1315041102 
1513020411 
0213041511 
0411130215 
1504111302 
0204111513 
0402111315 
1504130211 


Not every event 
is a bit 


Probability of a 
next sequence 


Probability of a 
frequency on a 
next sequence 


14 Knowledge versus randomness 


Remark 1.2. In case of Example 1.2, quantities such as P(0) have no meaning 
with reference to elementary events, since each outcome is coded into two bits. 
Rather, we will compute P(1, 1) to refer to the probability that the next outcome 
is 3, and so on. However “Next bit = 0” is a compound event that we will learn 
to deal with later. 


Starting from Definition 1.5 we can give probability to more complex events. 
Namely, still in the assumptions of Theorem 1.1 the probability P(0,1) that the 
first new observation is labeled by 0 and the subsequent by 1 is given by 


o (n=k+1)(k+1)n!  (n—k+1)(k +1) 
PONS m eat OP 


since we have n—k+1 zeros to fill the n+1-th position and k+1 ones to fill the n+ 
2-th position of the augmented observation sequence; the probability P(1,0, 1) 
of having a sequence of three new observations labeled as in the parentheses is 


P(1,0,1) = (1.8) 


from analogous computations. 
More in general, the probability that in the next N observations we will find 
exactly K 1’s in a given order is: 


K Ts 
——. 1 


(n—k+N—-—K)(n-—k+N-—K-1)...(n—k+1)n! (1.9) 


The previous formula refers to a given string with a given sequencing of its items. 
On the other hand, the probability we found is independent of the special order 
with which the elements appeared. Thus, if we are interested just in the number 
K of 1’s we will observe in the next N bits, the strings we must count within 
the augmented symmetry ensemble to get the frequency on hand are all those 
having exactly K 1’s within the last N elements. These are in number of (*) 1n 
times the number of strings with the 1’s in a fixed suffix sequence. Thus the 
probability P(K; N, k,n)!? of finding 1’s in the next N bits for a total of K, 
as we can state on the basis of the fact that we observed exactly k 1’s in n 
observations, is given by: 


11The binomial coefficient on N and K, see Appendix A. 
12Tn the following we will use a semicolon in order to distinguish among arguments and 
parameters of a function. 


Probability measures 15 


P(K;N,k,n) = (re) ht B+ 1). 4) 


x(n-k+N-—K)(n-—k+N-—K-1)...(n—k+1)n! 


6 Cr) 

kj \K 

ddd 

n+N (110) 
k+K 

In this way we obtain a formula (a sentence indeed) that depends on the listing 


sequence of neither past nor future individual observations. Rather, its value 
depends on: 


e the total number n of observed data; 

e the number k of observed 1’s; 

e the total number N of data we plan to observe; 

e the number K of 1’s we may observe in the future. 


In order to reduce the cardinality of the parameters’ set, and thus the knowledge 
necessary to give output to our formula, we will apply the following approxima- 
tions: 


1. Let us consider the situation after many years of your using the mailbox, 
so long a time that you have accumulated a long list of e-mails. With 
this length n increasing as you wish (the mathematician says with n going 
to infinity and denotes it n — +00) you don’t care about any additional 
fixed number c of new messages in your inbox. Then we can write: 


Pa) Ë (1.11) 

P(0)=1- Ë (1.12) 

P(0,1) = G- E) (1.13) 

roan) w 
emd A (-8) 

s(a Ponu (1.15) 


where x ~ y denotes the fact that x differs from y by a quantity going to 
0 with n — +00. 


Too many 
parameters 


When we observe 
for a long time ... 


Asymptotic 
meaning of P 


One parameter 
less 


Meaningfulness 
Complementarity 


Independence 


When we plan to 
observe for a 
long time 


A persistent 
asymptotic 
frequency playing 
the role of a 
constant 
parameter 


16 Knowledge versus randomness 


There are four advantages to this limit: 
a. writing the above formulas as a function of the ratio E, thus reducing 
by one unit the number of involved parameters; 
b. having this parameter coincide with P(1); 
c. getting P(0) + P(1)= 1; 
d. expressing the probability of a sequence of events through the prod- 


uct of the probabilities of the single events. 


The latter benefit, which in the real strings holds only approximately 
because the strings are finite, is usually denoted as independence between 
events. 


Definition 1.6. Two events A and B belonging to a same outcome space 
are said independent if P(AN B) = P(A)P(B). 


Remark 1.3. Independence is the asymptotic counterpart of the full ex- 
pansion of the symmetry ensemble of a string s. It derives from the fact 
that we have no reason to discriminate between permutations of the data 
in s, i.e. no special structure exists connecting the data to their position 
in the string. This is precisely the difference between examples 1.1 and 
1.2. 


2. Let us push toward infinity the length of the future data string. After some 
algebra on the central term in (1.15), rather on its equivalent expression 


(OR os 


derived from (1.10) with K and N going to infinity, we have 1° that the 
most probable value of K is such that 


Kok (1.17) 


Moreover, as we will see in Sec. 2.3.2.4, the probability of this value tends 
to 1 with N as well.Finally, given the symmetry of the two expressions 


13 A very broad way is to consider the first and second derivative w.r.t. K approximated by 


k —k 
a continuous variable of the logarithm of (o) (¥) (1 — x)" as the limit for N and K 
n-k k 


going to infinity, respectively E Nog d= N RY and realize that K as in (1.17) 
nullifies the former expression and that the second derivative is always lower than 0 in this 


point. See also Example 2.38 for a more rigorous proof of this statement. 


Probability measures 17 


(1.15) and (1.16) 14, we have that with n and N going to infinity the two 
K 


ratios = and E tend to a same value p if it exists. Namely, we assume p 
to be the limit of the 1 rate ¢ in the string that we are observing, as a 
consequence of the fact that we are actually observing the outputs of the 
same phenomenon, and say that the analogous rate ® in the infinite suffix 
of it tends in probability to p. To distinguish the two limits we denote 
the former with ¢ — p and the latter with ® + p (see Definition 2.14). 
The notations say that from among the almost infinitely long strings in 
the augmented symmetry ensemble the number of those with |® — p| > € 
becomes disregardable for any £ with the length of the not yet observed 
string going to infinity if | — p| becomes less than € as well. We can say 
that p is a constant of the phenomenon we are observing, representing the 
asymptotic frequency of 1’s in every long sequence of observations of it. 
As a limit value we cannot know it and in any case will not exploit it in 
the real life limited sequences. However, it reveals an extremely useful 
parameter from which the canonical probability theories exactly start for 
describing the behavior of the data we will observe. 


1.2.2 Kolmogorov axioms of probability 


Pushing n and k to infinity, i.e. having acquired extensive experience about 
how your mailbox behaves, we learn from relation (1.11) that the experiment 
of observing the (n + 1)" bit of our sequence deserves the same probability of 
finding 1 in an arbitrary position, e.g. the jt}, within the already observed bits. 
In the latter case we have indeed the augmented symmetry ensemble coinciding 
with the symmetry ensemble of the original sequence (i.e. n bits with k 1’s), so 
that 

P(1) = ss ae (1.18) 

n! n 

The defect of this experiment is that it does not tell us anything new, since it 
refers to already observed data. Its value lies in the fact that the probability 
expression holds also for n and k small. 

To remove the mentioned drawback we introduce the random experiment, 
which consists in selecting at random one of the observed bits and observing 
it again. Thus the result of the random experiment is new, too, and therefore 
interesting. From a dual perspective, the lack of information — which we suffer in 
our computational context in terms of absence of procedures able to discriminate 
between different sequences of a symmetry ensemble — is volountarily induced 
here in the experiment to preserve its novelty. There is no computing machinery 
that can create this novelty by definition if our operational context is boundlessly 
powerful 15. In more feasible computational contexts, novelty is induced by a 
bit selection procedure that is difficult to invert, so that only people familiar 
with the procedure and not those who have simply observed the previous bits 


14This symmetry is already evinced by (1.10). 
15 where this power lies also in the availability of unbounded resources in terms of time and 
space. 


An infinite 
sequence that 
contains the next 
observations as 
well 


The randomness 
source 


Who computes a 
random 
experiment? 


Randomness as a 
property per se 


The three 
Kolmogorov 
axioms 


Records 
symmetry 
translates into 
single record 
equiprobability 


Attitude to 
occurring 


18 Knowledge versus randomness 


can forecast the new bit (which is another way of tackling the limitations of 
our knowldge at the basis of our approach to probability theory). This is what 
happens for instance in any lottery where careful rules bar the player from any 
real information about the numbers that will be extracted. In some cases a 
mechanism very hard to invert is introduced, as in having a blindfolded child 
extract the numbers 16. But a good computer program, like in electronic slot 
machines, is sufficient for achieving the same purpose. We will see it in Sec. 
1.3.6 when discussing simulation methods. In this scenario we can attribute 
directly to the random experiment, i.e. to the single generated bit, the property 
of being 1 with a given probability — a parameter we saw not changing when 
passing from one infinite string of data to another. Finally, this is also the 
probability of appearing 1 as the next result of the random experiment. This 
framework is captured well and ruled in its entire generality by the following 
Kolmogorov axioms. 


Definition 1.7. A probability space is a triple (Q, £, P) where Q is a non empty 
set, & a sigma-algebra on it and P a probability measure represented by a func- 
tion from È to the unitary interval [0,1] such that: 


e P(A) > 0 whenever A € È (P is a measure on the event algebra X ...); 


e P(Q) =1(... whose unit is the measure of the outcome space Q, i.e. the 
certain event ...) 

e P(UZL, Ai) = 77, P(Ai) whenever A; € E and A; N A; = Ø for each 
i,j E€ {1,...,n} with i Æ j and n possibly going to infinity (... and is an 
extensive measure). 


Within the Kolmogorov framework we get the probability (1.18) of selecting 
at random a 1 within the n bits starting from the assumption that each bit B; 
has the same probability of being picked. As >>", P(B;) = P(Q), by the third 
axiom, and therefore )>;_, P(B;) = 1, by the second one, P(B;) = 1/n for each 
i. Still by the third axiom P( “extracting a 1”) = }°, P(.B;,) — where j’s are the 
indices of the bits set to 1 — which makes k/n as in (1.18). Thus the probability 
measure therein defined inherits from the combinatorial framework the meaning 
of an asymptotic property of the experiment, which is usually referred to with 
the pre-scientific notion of attitude to occurring of the indicated events. We 
have many ways of getting rid of this attitude. For instance we may consult 
a diviner — out of any scientific scope; we can rely on our experience with the 
single event — typical approach to Fuzzy Sets theory [Cox, 1998]; we can resume 
our complete experience with a given random phenomenon ~— which allows us to 
fix the three ingredients (Q, X, P) of a probability space as in the scope of this 
book. 


16]ike in the italian lotto. 


19 


Probability measures 


1.2.3 Exploiting knowledge to build probability models 


A probability model is any specification of the boiling up probabilities. In par- 
ticular the former specification of the probability space stands for the equiprob- 
able model, which is the true root of any other model via the simple mnemonic 
rule that: the probability P(A) of any event A in the sigma-algebra over the 
equiprobable space is given by: 


number of favorable elementary events 


P(4) = (1.19) 


total number of elementary events 


The power of the axiomatic approach lies in the fact that we can discuss a 
probabilistic model fully, desuming theorems and operational decisions, totally 
ignoring the physical or conceptual experiment at its basis. Thus the same rule 
(1.19) applies for instance when we study the outcomes of a die cast or any other 
game of chance for which we can assume an equiprobable outcome space 1. 

Using the third axiom we may build new models by adding further spec- 
ifications to a current probability space as a counterpart of the shifting from 
a simpler to a more synthetic dictionary interpreting the signal we receive. 
This happens when we have a clearer idea of the scope of our observations 
on a phenomenon. Returning for instance to the bet of casting an odd 
number with a fair die, we move to a new equiprobable model where Q 
is made of the sole outcomes “success”, “unsuccess”, with P(“success”) = 
P(“the outcome is 2”)+P(“the outcome is 4”)+P (“the outcome is 6”) = 1/6+ 
1/6+1/6 (see footnote 17) = 1/2, and P(“unsuccess”) = 1/2 for similar reasons. 
The space is still not equiprobable if we bet on a prime '® number {1,2,3,5} 
outcome. In this case P(“success”) = 2/3 and P(“unsuccess”) = 1/3. In prin- 
ciple we may work with an outcome space where each elementary event cannot 
be attributed a reasonable non vanishing probability. This happens for instance 
when, shooting a target, we deal with an 2 where any point around the mark 
is an elementary event, but no continuous assignment of non vanishing proba- 
bilities to the events satisfies the second axiom. Discarding any philosophical 
argumentation, rather driven by the common practice of processing data by a 
computer, we bypass this scenario by assuming henceforth we are dealing only 
with enumerable outcome spaces 1°. In this frame the basic operation for con- 
ceiving a probability space is the following experiment: 


Definition 1.8. Given a probability space (Q, £, P) a probabilistic experiment is 
a production of an infinite sequence of elementary events e; E€ Q with probabil- 
ities p; described by P. 


17The probability of any face of a fair die equals 1 /6, of any roulette number equals 1/37 
in Europe and 1/38 in USA, etc. 

18 with the hold fashion convention of considering 1 to be prime. 

19That is to say that we will describe the hitted points coordinates with a bounded number 
of bits. 


The equiprobable 
model 


Frequency of 
good apples in a 
basket 


Probabilistic 
models in an 
apple basket 


Greater 
knowledge, more 


specific models 


From 
equiprobable to 
non equiprobable 
models 


Discreteness 
hypothesis 


Observing is 
tossing a random 
variable 


A probabilistic 


model is 
self-contained 


_ General 
properties on an 


apple basket 


; _ Ó is the 
impossible event 


More information 
needs with no 
disjoint events 


20 Knowledge versus randomness 


Remark 1.4. As you may realize, we use the world experiment to denote oper- 
ations on sequences of elementary events. In particular: 


e an experiment is a family of sequences having a given prefix, or their 
generating procedure by abuse, 


e arandom experiment picks at random one element in a sequence, and 


e a probabilistic experiment produces an infinite sequence according to given 
probabilities. 


Suitably combining elementary events we obtain new events that may figure 
as elementary events of a new probability space. Once we moved to the new 
model, we can forget about the original one for doing any further computation. 
Thus nothing of the previous model will help us to avoid to move on the rocks 
because of the dice. Indeed, all statements we can make about a phenomenon 
described by a given probability space may be computed exclusively by an iter- 
ated application of the three Kolmogorov axioms. Some of the usual subroutines 
in this procedure are represented by the theorems listed in Table 1.5. The ra- 
tionale of these subroutines is generally trivial. We detail only the less intuitive 
ones. 


Theorem 1.2. P(Ø) = 0. 
Proof. 


P(0) + P(Q) = P@UQ) (since ØN Q = Ø) 
) (since UQ = Q) 


ergo P(Ø) = 0. o 
Theorem 1.3. P(AU B) = P(A) + P(B) — P(AN B). 
Proof. If AN B = Ú then the claim holds by the third axiom and Theorem 1.2. 


Otherwise, A — AN B, AN B and B— AN B are disjoint sets 7° and their union 
is AU B. Therefore 


+P(B- ANB) 
A) — P(AN B)+P(AN B)+P(B)-P(ANB) 
A) + P(B)—P(AN B) 


Probability measures 21 


Table 1.5: Some corollaries of the Kolmogorov axioms. 


Facing a random phenomenon such as the message history of our mailbox, we 
try to describe it though a probability model. Its design generally comes from 
both past observations (messages about houses for sales with a given frequency, 
no messages from the National Library, etc.) and our own conceptualization of 
the phenomenon (no messages longer than 100 kilobytes because my provider 
does not allow it, and so on). Its value depends on the suitability of the oper- 
ational decision we take about it. A professional skill in designing a model is 
to involve a very low number of parameters. This gives the twofold benefit of 
a high manageability of the model in terms of adaptability to/checkability of 
the agreement with the phenomenon at hand, and a great extensibility to other 
phenomena. 

Besides the mentioned equiprobable model, we implicitly enunciated two 
further models and related experiments, namely 


Definition 1.9 (Bernoulli model). Given p € [0,1], the model is defined as 
follows: 


e outcome space Q = {e, e°}; 
e event algebra X = {0), {e}, {e°}, {e, ec} }; 
e probability measure P defined by 


MN 
P(u) = f ee (1.20) 
l-p ifu=e* 


A Bernoulli experiment is a probabilistic experiment with two sole possible 
outcomes { “success”, “unsuccess” }. 


Definition 1.10 (Binomial model). Given n € N and p € [0,1], the model is 
defined as follows: 


e outcome space Q = {e0, é1,...-, en}; 


20The intersection operator is evaluated before the subtraction one, according to the usual 
precedence rules for set operators. 


Fewer parameters 
better model 


An experiment 
that just can be 
true/false 


but how many 
times? 


If the Bernoulli 
___ trials are 
independent 

(extraction with 


replacement) 


If the Bernoulli 
trials are not 
independent 
(extraction 
without 
replacement) 


specially if the 
choices are 
limited. 


22 


Knowledge versus randomness 


e event algebra © = 2°; 


e probability distribution P defined by 


(Jro -p i 


For e; = i for each 2 we interpret this model through a binomial experiment 
that is a probabilistic experiment consisting in the execution of n independent 
Bernoulli experiments with same parameter p. On each n-ple of Bernoulli events 
the binomial experiment’s event counts the number of successes occurred. 


P(e;) (1.21) 


Using the notation of the Bernoulli model for describing the experiment, thanks 
to independence definition (Definition 1.6), we have that the probability of the 
string 


k times n -— k times 


— = 
(€, e, .., è, €, @,....,2@) 


n—-k 


equals p*(1 — p)"—*, which makes the probability of having any one among the 
(X) strings with exactly k successes and n — k unsuccesses as in (1.21). 

Thus we recover the result (1.15) just by passing from one experiment to the 
other without any reference to the operational context of our mailbox where 
we started the whole matter. Most of the probabilistic models we will consider 
in the following are an approximation of exactly the binomial model. In turn, 
the latter figures as the approximation of an almost binomial model where the 
independence between the Bernoulli experiments is not applicable. Namely, 
consider a special Bernoulli experiment where the success consists in drawing a 
red ball from an urn of N equiprobable balls K of which are red. Then assume 
that during the n repetitions of the experiment the drawn balls are not put back 
in the urn. After some algebra (see Appendix A) and (1.19) we realize that this 
specification gives rise to the following model. 


Definition 1.11 (Hypergeometric model). For n, N, K € N with N greater or 
equal to both n and K, the model is defined as follows: 


e Q and È as for the binomial model; 

e probability distribution P defined by 

(K) (SA) 
(al 


P(e;) = (1.22) 


It is logically evident and formally proven that for any K the latter probability 
distribution gets close to the binomial’s with N increasing, which allows us to 
deal with more manageable analytical forms. 


Random variables 23 


1.3 Random variables 


When the description of the events within an outcome space has an inherent 
order so that they may be gathered by a variable, it is useful to work with 
this variable for two reasons: 1. an easier description of relevant events and 
their measures, and 2. the identification of meaningful parameters that remain 
asymptotically constant along the observation history. 


Definition 1.12. For a given probability space (Q, £, P), a random variable X is 
a bijective function 7! from the elementary events of 2 to an enumerable subset 
of the real line R 7. 


Notation 

We will denote with capital letters (e.g. X) a random variable and with the 
corresponding lower case letter (e.g. x) a value that can be assumed by the 
variable. We call x a specification of X. By extension, we are also inclined 
to use capital letters to denote free parameters referring to a population (for 
instance N for its size) and lower case letters to denote ones referring to a given 
prefix (for instance n for its size). 


Thus X is a concise dictionary for describing the events of Q with the sole 
constraint that we can attribute a probability to the event (X = x), for each 
x € Dx, where Dx denotes the set of values we consider X may assume. This 
means that for each such gz an elementary event e in Q is associated to the event 
(X = x) with P(X = x) = P(e) and Dx coinciding with the outcome space. We 
read (X = x) and (X < x) as the events: “A specification of X we may observe 
exactly equals x” and “A specification of X we may observe is less or equal x”, 
respectively. The above constraint is satisfied also if we map from a partition 
of Q to R, which means however that the mapping is from another outcome 
space Q having the elements of the partition as elementary events. Suppose for 
instance that in a dice game the bets are related to the fact that the result is 
an odd or an even number. In this case the outcome space { “odd”, “even” } is 
more suitable than the original one {1, 2,3,4,5,6}. 

Conversely, we move to a universal outcome space given by the real line R by 
default and base our reasoning on the constancy of the asymptotic frequencies 
with which a given random variable X takes values on this space (for short, 
X € R). The probability distribution on it is now a function of the specifications 
x benefiting from the common mathematical abilities of giving values to the 
functions or accumulating them on special subsets of R. As Q is enumerable 
this function takes values different from zero in a discrete set of points that 


21A function f : X + Y is bijective if for every y in the codomain Y there is exactly one x 
in the domain X such that f(x) = y. 

22 This is a way of defining a random variable simpler than in typical probability books. It 
is allowed by our hypothesis on the enumerability of the outcome space. A less demanding 
definition refers to any function X from events in Q, provided that the set {A : X(A) < r} 
belongs to X for any r € R. 


A mapping from 
events to 
numbers 


Random variable 


X — variable | 
x — specification 


A mass of 
probability to 
each discrete 
event (X = x) 


At your 
convenience 
sprea 
probabilities on 
segments 


Continuous 
variables for 
approximating 
discrete variables 


Point description 


24 Knowledge versus randomness 


Fig. 1.1: Graph of the cumulative probability distribution of: (a) X uniform in 
{1,...,m}, and (b) companion X in [0,1], for m = 10. 


we call mass points having probability masses summing to 1. We associate all 
the remaining points to Ø, i.e. the impossible event. In this sense we speak of 
X as a discrete random variable. However, for the sake of simplicity, we often 
approximate X with a continuous variable X by spanning the probability masses 


d along the gaps between the discrete values. In this case no event (X = x) has a 


probability greater than 0. But we can arrange the spanning so that the pair of 
events (X = x”) and (a! < X < x”) (i.e. X = x” except disregardable details) 
have the same probability for each consecutive mass points x’ and x” of X ?”. 
Consequently, the two events (X < x’) and (X < 2’) have the same probability 
as well. For instance, we do this with a variable X assuming integer values from 1 
to m with equal probability as in Fig. 1.1(a) by defining a companion continuous 
X asin Fig. 1.1(b). It ranges from 0 to m and has an equal probability of 
assuming values within any pair of equal width intervals. Consequently P(i—1 < 
X <i) =P(X =i) = 1/m and the probability of the event (X < 2) is a linear 
function of x more analytically simple than the analogous function for X. 


1.3.1 Standard ways for describing random variables 


As mentioned before, by default a random variable may assume any value be- 
tween —co and +oo. Thus what specifies the variable are the probabilities with 
which these values are assumed, i.e. what is called the distribution law of the 
variable. Its most immediate description is pointwise, namely: 


Definition 1.13. Given a random variable X, the probability function (p.f. for 
short) Px(x) computes the probability of the event (X = x). In formulas: 


Px (x) = P(X =2) (1.23) 


23. with obvious extension for the extreme values. 


Random variables 25 


In this book we adopt the epistemological constraint that P is a computable 
function with no special mention to the computational resources required. This 
constraint, which matches very well with the computational learning framework 
of the book and in any case with most operational environments, allows a very 
pragmatic approach in the following. 

When the natural order of the numbers on the real line fits with the physics 
of the phenomenon under observation, a useful way of describing X is through 
the cumulative probability over intervals. 


Definition 1.14. Given a random variable X, the cumulative distribution func- 
tion (c.d.f. for short) Fx (a) measures the probability of the event (X < x). In 
formulas: 

Fx (x) = P(X < 2) (1.24) 


The obvious relation between the two functions is stated by the following equal- 
ity: 
Fx(z)= $. Px(x) (1.25) 


xXx 


Fx has a staircase shape with a step at each mass point (see Fig. 1.1(a)). As 
mentioned before, we can decide to smooth its discontinuity into a continuous 
function in order to deal with a more manageable analytical form. This has 
the drawback of vanishing the probability masses, because of the continuity of 
Fx, and the benefit of introducing a continuous function fx with the role of 
probability density in place of Py. Namely we define”*: 


Definition 1.15. Given a random variable X having a continuous cumulative 
distribution function Fx, the density function (d.f. for short) fx is the derivative 
of Fx. Namely: 

dF yxy (x) 


fx(z) = a (1.26) 


The semantic counterpart of this definition gives fx (x) the meaning of a density 
w.r.t. the probability measure P(x) on X. You could consider fx(a) as the 
limit with Ax going to 0 of the incremental ratio P(x < X < x + Az)/Ax 
between the probability of having X in a Az wide neighborhood of x and this 
width. We will use this interpretation simply in an evocative picture, evading 
any mathematical subtlety and caveat coming from the application of measure 
theory to infinitesimal items ?°. 


?4with our usual pragmatic conciseness. 
?5in a way that will not cover some specific mathematical details related to the definition 


of a probability density. 


Description by 
intervals 


Fx hasa 
monotone shape 


Density 
description 


From density to 
probability 


Possibly unuseful 
details 


26 Knowledge versus randomness 


Table 1.6: E-mail message spam indexing executed by SpamAssassin. 


EXCUSE.15 0.5 pts) Claims to be legitimate e-mail 


CLICK_BELOW 1.5 pts) Asks you to click below 


EMAIL_MARKETING (0.3 pts) Talks about e-mail marketing 


WEB-_BUGS -0.4 pts) Image tag with an ID code to identify you 


Remark 1.5. In this section we will normally maintain a different notation for 
p.f., with Px, and d.f., with fx, where the former is used by default, and the 
latter when expressly required by the continuous approximation of the random 
variable. Vice versa in the following chapters and in the appendices we will 
normally use fx to denote both p.f. and d.f. for homogeneity sake, distinguishing 
in the text between the two functions whenever necessary. 


Example 1.6. Returning to our lead example, usual spam sniffer systems in an 
e-mail server are based on a graduation of some indices affecting the message. 
For instance, in Table 1.6 we report a set of such graduation rules from the 
SpamAssassin open-source mail filter (see http://www. spamassassin. org). 

In this system the spam index of a message is given by the sum of the listed 
values for those voices that are applicable to the message. This index is a 
discrete variable ranging from —1.2 to 14.1 with a step of at least 0.1. In the 
case that each of the 12 features may appear or not in the message with the 
same probability 1/2 independently of the others the ranking of the messages 
has the probability distribution as in Fig. 1.2(a), which we can comfortably 
describe with a continuous function, i.e. with a continuous probability density 
linked to the probability through the simple relation 7° 


Px(a) = f © Prod ~ fac(2)0.1 (1.27) 


Of course in this approximation we lose details such as the zoom on the right 
tail of the above graph denoting for instance the impossibility of having a rank 
equal to 13.9 (see Fig. 1.2(b)). If we put a threshold on our spam sniffer, so 


26 An equivalent link could be Px (x) = ees fx (x)dx; the two relations are really dif- 


ferent but collapse in the same on a same approximation order. 


Random variables 27 


1234567 8 91011121314% 


(a) (b) (c) 


Fig. 1.2: Distribution of the score index attributed to a message according to Table 1.6 
in the equiprobability hypothesis for each index. (a) Original spam score distribution. 
(b) Score details in the right tail. (c) Mixed version distribution. 


that any rank greater or equal to 8 collapses in the dummy rank 8 determin- 
ing the scratching of the message, the new random variable has a probability 
distribution as in Fig. 1.2(c) which suggests a mixed representation: through a 
density function before 8 and a probability mass, hence a probability function, 
concentrated exactly on the mass point 8. There is nothing puzzling about this, 
provided that o fx(x)dx + Px(8) = 1; it denotes a mixed distribution law 
The sole caution is for the fact that the ordinate of the mass point goes to 
infinity if we move to the graph of fx (see comments to Fig. B.3). 


1.3.2 Fixed properties in a sequence of variable specifications 


A second advantage of introducing a random variable is that we can fix other 
asymptotic parameters that remain constant during the probabilistic experiment 
of drawing specifications of the variable. We do this through the following 
operator. 


Definition 1.16 (Mean (value)). Given a random variable X assuming specifi- 
cations with probability function Px within the set Dx, the mean ux of X is 
computed as: 


ux = > 2Px(2) (1.28) 


«E€Dx 
| 


Consider the time you waste opening unsolicited e-mail messages. On each let- 
ter you will spend more or less time depending on the number of words you must 
read before understanding the purpose of the letter, the style of the sentences, 
etc. Resuming this time with the random variable X, whose specifications are 
for instance discretely measured in seconds (thus X € N), you rely on the fact 


Further constants 
within a sequence 


Balancing values 
with probabilities 


28 Knowledge versus randomness 


Sometimes a that Py (3) = 0.4 means that in a long sequence of unsolicited messages around 
single value is ; ate ; ‘ 

sufficient 40% waste 3 seconds of your time each. You may have similar information 

for any other reading time, but probably don’t need such detailed knowledge. 

Wondering about the total time you will waste in screening the next m useless 

messages will receive in the next month, you are just interested in the value of 


the variable B 
Ttot = 5 Ti (1.29) 
i=l 
“th 


where zx; is the specification of X in correspondence to the i of the m useless 
messages you will screen. Now, let us rewrite the above expression as follows: 


Tto =M (2 5a) =m S5 joo) (1.30) 


i=1 


where ¢(7) is the relative frequency with which you will meet messages causing a 
waste x = j over the next month history and £max the maximum of these values. 
As 6(j) = Px (j) when m goes to infinity and the related symmetry ensemble is 
fully expanded (see end of Sec. 1.2.1, and Remark 1.7 and Theorem 2.4 later on) 
equivalent toa in this limit the desired quantity £tot can be computed as m times the mean value 
variety of values Of ¥ Now, if you receive only 1,000 messages next month then you may assume 
that ror & Mux, and if the messages are 5,000 the proximity in percentage?” 
of the two quantities should be better, and so on. The operational benefit of 
using the mean value in place of computing the sum of the x; is evident. If you 
need £tot before receiving the messages then the above sum cannot be computed 
computed once (prediction problem). But even if you want this value for a subsequent balance, 
and used always the sum comes at a cost. As a constant parameter of X, px can be computed 
once and used whenever it is requested. As a matter of fact we approximate ux 
> asan taking Z, = 1/v $; a; for very large v and approximate Zm = 1/m oy", ti 
approximation of z . g 28 
an asymptotic With ux with an accuracy that depends on the size of m 4°. The course of the 
value. accuracy increase with m is a key matter that we will study in this book in 
order to get reliability for the operational decisions we will take about X. For 
fixed ux, the total time to be wasted in the next m messages depends on the 
messages that actually come. Thus it is a specification of a random variable 
Xtot as a function of the random variables {Xj,..., Xm}. We will consider the 
detailed analytical form of its distribution law later. In Fig. 1.3 we show the 
course of Xm = Xtot /m with m by comparatively considering its distribution 
around uy for some values of m and above hypotheses on relationships between 
observed variables. 


Example 1.7. Consider a random variable X distributed according to a 
Bernoulli law of parameter p. From (1.28) we have 


px =0(1-p)+1p=p (1.31) 


27We cannot expect to reduce the difference | X21 zi — mux | by increasing the variability 
of the sum with new addends; but we enjoy the fact that this difference grows more slowly 
than the sum tot. 

28We may expect to reduce the difference | 7’, xi/m — px], indeed. 


Random variables 29 


Fig. 1.3: Approaching of Xm to ux with increasing m. Curves: piecewise lines con- 
necting the values of Xm probability function, with X following a Bernoulli distribu- 
tion with ux = 1/2. Curve parameter: m, increasing from 10 (the flattest curve) to 
100 (the most peaked curve) with step 10. 


so that the mean and the distribution parameter coincide. 
If X follows a binomial distribution of parameters n € N and p € [0,1], using 
again (1.28) we get 


where the last equality derives from the Newton expansion (a + b) = 
Sio (atb! -i, or directly from the fact that 2a (7) — p)"-)- is 
the cumulative distribution Fy (n — 1) of a binomial variable Y of parameters 
n — l and p. Hence this sum equals to 1. The mean of X coincides therefore 
with the product of its parameters. 

Finally, consider the case of X following a continuous uniform distribution 
over the set [a,b] (see page 342 in Appendix B for its formal definition). Since 
this distribution is continuous, (1.28) specializes in substituting an integral for 
the sum and the probability density for the probability function, as follows 


[Lx =} ufx (a)da (1.33) 


A family of 
equivalent values 


Expected => 
weighted with 
probabilities 


and generating a 
mean value 


30 Knowledge versus randomness 


Therefore 
-f lja PPO 1 P-a 
MA ta. bea Z|. bea 2 
1 (a+b)(b-—a) a+b 

= MM 1.34 
b—a 2 2 a) 

Thus the mean value of X is the mid point of the interval (a, b]. 
a 


The mean jx plays the role of equivalent value of the random variable, since 
you can substitute with it all the various specifications you meet of the variable 
if the final goal is to appreciate their sum. Then, for example, a city manager 
needs the mean water consumption per family to fix the daily requirement of 
water of a 1,000-family town or a server provider needs the mean number of 
accesses requested by its clients to determine the server capacity, and so on. 
There are lots of equivalent values you may want to know about a given variable. 
For instance, the mean of the section of a bar with random diameter, or the 
maximum of incomes in a group of corporate executives, etc. You get these 
parameters in the same way provided you are able to figure out the values as 
functions of the specifications of the random variable at hand. Namely, for 
a given function g 7° of the X specifications you may rely on the asymptotic 
constant: 


Definition 1.17. Given a random variable X assuming specifications with prob- 
ability function Px within the set Dx, and a function g : Dx + R, the mathe- 
matical expectation E|g(X)] of its application to X is computed as: 


E(g(X)] = $, 9(z)Px(z) (1.35) 


rE Dx 
a 


Thus the expectation operator makes a weighted sum of the values the function 
may assume with the specifications of its argument, where the weights are given 
by the probabilities with which the specifications are assumed. The general rule 
is that in this sum a greater contribution is given by the values we will meet 
more frequently. With g representing the identity function we realize that 


ux = E|X] (1.36) 


That is the reason why we call ux also expected value of X. The obvious exten- 
sion of these formulas in the case of continuous approximation of the random 
variables is: 


29 As usual we assume g a computable function by default. 


Random variables 31 


Definition 1.18. In case of continuous random variable X 


+oo 


Elg(X)] = J r (1.37) 


=00 


A nice feature of the mathematical expectation operator is the extreme simplic- 
ity of the companion operator by which we make an approximation of it from a 
series of observations. 

With reference to (1.30), observations of a given X are a set {£1, ..., £m} of 
its specifications that we call sample of size m, and we simply take the arithmetic 
mean of them in place of E[X]. More in general we rely on the fact that 


Bl X= E ge)Px(x) ~ Dale = BO (1.38) 


zE€Dx 


which is based on the asymptotic convergence of the frequency with which a 
specification x occurred in a sample to its probability Px (a) for fully expanded 
symmetry ensembles, and extends this convergence to any expected value. In 
this perspective we give Px (x) the meaning of frequency of x over an infinitely 
large set of observations, denoted a population. The population represents X, 
and a subset of it constitutes the sample. The simple rule for taking an approxi- 
mate companion of the expected value from a sample is: “give a same weight to 
each observation you got from a phenomenon” °°. The approximating operator 
is denoted as an estimator of E[g(X)], its simplicity is rewarded by many nice 
properties that we will study in Chapter 2. 

Considering the expected value of a variable corresponds to taking a picture 
of the entire probability distribution of the variable. Multiplying the picture 
perspectives allows us to discover more and more features of the distribution. 
An exhaustive way of recovering with this strategy all the details of whatever 
distribution consists in simultaneously taking the expected values of the powers 
X" of the specifications for any r € N. Namely we define: 


Definition 1.19. For a given random variable X and observed sample 
{x1,...,2m}, the r** moment ur of the variable is the expected value E[X"] of 
the rt power of X 3! : 

pr = B[X"] (1.39) 


30Note the subtlety: each observation has the same probability, whatever the observation 
content; the observation content has a probability approximated by the frequency of its oc- 
currence in the sample. 

31 As you may see, letter “m” is an overloaded symbol. In order to diminish the occasions of 
misunderstanding, by default we adopt along the text the following notation: Greek symbol 
H for expectation over a given population, slant symbol m for sample mean and regular lower 
case m for sample size. Problems arise for the random version of the first two symbols, as 
capital u essentially coincides with capital m. To avoid confusion we will denote with M the 
random variable having specification u and with M the analogous variable for m. 


Frequency 
convergence <> 
mean 
convergence 


Many 


perspectives 
many moments 


Same moments 
< same variables 


How X escapes a 
precise value 


32 Knowledge versus randomness 


fu, fuz fui: fua 


0.6 0.6 0.6 


1 1 
1 1 
0.5 1 0.5 
1 i 
1 1 
0.4 | 0.4 ! 
1 1 
0.3 1 0.3 1 
1 i 
1 1 
0.2 i 0.2 ! 
1 1 
0.1 1 0.1 i 
1 1 
u po E -U omy T A 
é 3 6 


o 
N 
PA EESE EEN tos 


Fig. 1.4: Comparison between the density functions of two uniform random variables 
Uı and U2 having the same mean E[Ui] = E[U2] = 3, marked by a vertical dashed 
line. (a) Black plot: V[Ui] = 4. (b) Gray plot: V[U2] = $. (c) Comparison between 
the densities. 


and the companion sample moments mp is: 


m, = Soa (1.40) 


It is obvious that with similar arguments we can rely on m, as an estimator of pr. 
The exhaustiveness of the representation of Px (x) through the set {j1, H2, ...} 
for any power in the natural numbers comes from the following fact, whose proof 
is omitted. 


Fact 1.1. Consider two random variables X and Y , and denote with už and pX 
the r moment of X and Y, respectively. With some caveats **, if the infinite 
sequences (už, r=1,2,...) and (u¥, r=1,2,...) coincide, then X and Y 
have the same distribution. 


Besides the moment of order 1 that coincides with the mean of a random 
variable, special information is given by the following parameter. 


Definition 1.20 (Variance). Given a random variable X, the variance of X is 
computed as: 
V[X] = u2 — pi = E(X — E[X])*] (1.41) 


The variance is a dispersion index of the distribution of the X specifications 
around their mean value. The more a variable prefers specifications far from 
the mean the more its variance increases. See Fig. 1.4 to capture this concept 
visually. 


32in general if both variables have a moment generating function [Billingsley, 1995]. 


Random variables 33 


Example 1.8. Coming back to Example 1.7, consider a random variable X 
following a Bernoulli distribution of parameter p. According to the second 
equality in (1.41) and recalling that E[X] = p, the variance of X is 


V[X] = (0- p)?(1—p) + (1—p)?p = p?(1—p) + p(l—p)? = p(1— p) (1.42) 


In case of X distributed according to a binomial variable of parameters n € N 
and p € [0,1], we can compute the second moment of X as 


n n—-1 
) n— 1 n= = 
w= De Jra -p =n U +0/ ia -p 1)-j 
i=0 j=0 
fea), > i n=l ; 
= np Y( j pa -pe +> ( Ja- pe 
j=0 j=0 


=np((n—1)p+1)=n(n—1)p?+np (1.43) 
where 


e the first equality has been derived through the same technique used for 
obtaining the mean of X in Example 1.7; 


e the two terms in brackets in the central term amount respectively to the 
mean of a binomial distribution of parameters n — 1 and p and to 1 ac- 
cording to the Newton expansion. 


Applying the first equation in (1.41) and recalling that E[X] = np we obtain 
V[X] = n(n — 1)p? + np — n?p* = np(1 — p) (1.44) 


Finally, consider X following a uniform distribution in [a,b]. The second 
moment of X is 


b 3 
1 L g 

— 2 = 
w= | Yaa Gan a 


so that applying the first equation in (1.41) and recalling that E[X] = “4 
yields 


: 1 ba?  b?+ab+a? 


b-a 3 3 


(1.45) 


a 


pj EE -e 


3 2 12 


A repeated application of relation (1.38) to the first order moment leads to 
the following definition of empirical cumulative distribution function: 


Definition 1.21. For a given Xm = {21,..., £m} C R, the empirical cumulative 
distribution function Fx„ : R > [0,1] is defined as follows 


ES fi 
Pem (t) = — J 1-c0,2}(#:) (1.47) 
i=l 


Frequencies on 
intervals 


A primordial way 
of appreciating X 
distribution law 


True distribution 

is an empirical 
distribution on a 
very large sample 


Histograms 


Two more central 
values 


34 Knowledge versus randomness 


where Ip(b) = 1 ifb € B and 0 elsewhere is the indicator function of the set B. 


Indeed, as Xm is the specification of a random sample of size m drawn from 
X, Fy, (x) converges to the cumulative distribution function of Fx (x) when m 
increases to infinity, for the mentioned symmetry properties of the observations. 
Globally, as the boiling up random variables are m Bernoulli’s of parameters 
Fx (aj) it is possible to show that 33 


P ( lim _ sup |Fx (x) — Fy, (v)| = 0) =1 (1.48) 


m—+00 


From Definition 1.21 we deduce that the graph of an empirical cumu- 
lative distribution function is piecewise constant with discontinuity points 
in correspondence to every sampled value. More precisely, the it discon- 
tinuity point has ordinate i and abscissa equal to the i** sample item 
once the sample has been sorted in nondecreasing order. For instance, 
in Fig. 1.5(a) we report a function computed on the basis of the sample 
{14.59, 8.96, 0.75, 1.99, 6.73, 2.58, 5.88, 3.39, 4.27, 0.84}. The empirical distribu- 
tion function is a widespread tool for having an idea of the shape of a random 
variable distribution, being a true (analytical) distribution function the evolu- 
tion of this shape when the sample extends to the whole population. In Fig. 
1.5 we may appreciate the convergence of the former to the latter function. We 
artificially draw (i.e. simulate with the techniques that will be explained in Sec. 
1.3.6) a sample of increasing size from a given distribution law 3+ and compare 
the graphs of the two functions. With a sample of 1,000 elements we detect 
only minor differences between the curves, even though the empirical function 
is discrete and the true one is continuous. 


Remark 1.6. The non cumulative version of these diagrams, namely the his- 
tograms, are employed in many circumstances: having grouped contiguous val- 
ues we report in the ordinates axis the frequencies of the groups. Figure 1.6 
reports the companion histograms of Fig. 1.5 for a suitable choice of the grouping 
intervals °°. Note that we report probabilities p of the same grouping intervals 
on the part of the true distribution since we cannot directly compare hystogram 
frequencies with probability densities. 


Two further perspectives through which to highlight specific tendencies of 
the random variable lead to two parameters that we will subsequently use. 


33This result has been obtained by various researchers through different proof strategies, 
however the most common way of referring to it is as the Glivenko-Cantelli theorem [Glivenko, 
1933]). 

34Namely, the negative exponential distribution with A = 0.2 (see appendix B). 

35 which corresponds to discretizing the values. 


Random variables 35 


Fx, Fx, Fx Fxy, Fx, Fxm 
1 1 1 
0.8 0.8 0.8 
0.6 0.6 0.6 
0.4 0.4 0.4 
0.2 0.2 0.2) 
2.5 5 1.5 10 12.5 15 T 10 20 30 40 T 10 20 30 a5 © 
(a) m= 10 (b) m = 100 (c) m = 1000 


Fig. 1.5: Convergence of the empirical (Pxm, gray plot) to the analytical (Fx, black 
curve) cumulative distribution function. m: sample size. x: variable realizations. 


T 


2.5 5 T5 10 12.5 15` 


(b) m = 100 (c) m = 1000 


Fig. 1.6: Companion histograms of the empirical (dark gray) and analytical (light 
gray) cumulative distributions in Fig. 1.5. m: sample size. Grouping intervals width: 
(a) 3; (b) 2; (c) 0.5. x: sample realizations, ¢: realizations’ frequency, p: interval 


probability. 


median 


mode 


and their mutual 
positions. 


Joining bits 


36 Knowledge versus randomness 


Fig. 1.7: Mode, Median and Mean (respectively, arrows from left to right) of the 
negative exponential distribution with parameter A = 1. Black curve: graph of the 
related d.f. 


Definition 1.22. Of a given random variable X: 


e the median is the value x such that P(X > x) = P(X < 2), with suit- 
able approximations taking into account possible asymmetries when X is 
discrete; 


e the modal values are those x where fx(x) is maximal. In the case where 
there exists a unique maximum, we call mode the corresponding x and 
unimodal the distribution. 


Example 1.9. Figure 1.7 shows the location of mean, mode and median in a 
negative exponential distribution (see Appendix B). 


While denoting very local characteristics of X distribution, these parameters 
are often strongly connected to the mean of x. Namely: 


Fact 1.2. If X distributes symmetrically around a point xo, median, mode and 
mean coincide with xo. Under loose regularity conditions [Mood et al., 1974] 
we expect the median to be closer to the mode than the mean in a unimodal 
distribution law. 


1.3.3 From single to many bits valued variables 


Counting the number of unwanted messages in your mailbox is a suitable mean 
for gauging your annoyance with the phenomenon but not for understanding 
how to diminish it. Elementary logical tools like “if-then-else” rules require the 
joint examination of at least two events in the hope that a causality relation 
exists between them. Thus you may try to understand whether unsolicited 
messages come more frequently on the weekend or on workdays. Maintaining 


Random variables 37 


the code: 1 — unwanted message, 0 — wanted message, we can join a second 
bit to the record qualifying a message being 1 if received on a weekend, 0 
otherwise. To describe our mailbox with this richer vocabulary, we need to 
know the asymptotic ratio 

Pu = = (1.49) 

n 

where kz, counts the number of messages characterized by an interest denoted 
by x and date by y according to the above coding. Suppose you have already as- 
sociated similar ratios, namely k17,k71, separately to the attributes “unwanted” 
and “sent on weekend”. Can you deduce from them the value of quantity (1.49) 
or vice versa is this value a true source of new information? The first case occurs 
if in any large subset of messages with, for instance, one third of bits equal 1, 
two thirds 0 in respect to a given attribute, this rate remains the same what- 
ever the value we fix for the other attribute. This means that in the asymptotic 
sequence the number of 1’s for a given attribute scales by a factor equal to the 
frequency of 1’s in the other when we consider their joint occurrence. Namely 


k ky: 
ka = kr = kn £ (1.50) 
n n 
which makes P nek 
pie Se (1.51) 
n n n 
Returning to the coded events: A = “the message is unwanted”, B = 
“the message was sent on a weekend”, we have that (1.51) reads 
P(AN B) = P(A)P(B) (1.52) 
which qualifies the two events A and B as independent according to Definition 
1.6. In any case, we can write the asymptotic frequency k11/n as 
kı — ku ka — kar ki? 
— = a = 1. 
n key n kı? n hed) 
that we read as 
P(An B) = P(A|B)P(B) = P(B\A)P(A) (1.54) 


implicitly defining the function P(A|B) that we denote conditional probability 
of A given B and is computed by 


P(ANB) . 
eine er ee (1.55) 
0 otherwise 


It is nice to notice that, according to this definition, the unconditional proba- 
bility P(A) reads as P(A|Q). Vice versa, getting as sure an event B C Q, we 
must reduce 2 to B and within this new outcome space redistribute the unitary 
probability mass according to (1.55). Note that we are referring to a same set 
of observed e-mail messages and we read them with two different vocabularies, 
whose intersection gives rise to a more detailed way of examining the messages. 


The richer 
vocabulary, the 
more detailed 
frequencies 


Need _ for 
additional 
information 


apart trivial 
linearities. 


Event 
independence, 


otherwise 


conditioned 
events. 


The more the 
information, the 
narrower the 
outcome space 


Relevant and 
minor details 


Marginalization 


The root of any 
random variable 


Chain rule 


38 Knowledge versus randomness 


Shifting into the Kolmogorov framework, we may figure an extremely de- 
tailed outcome space, where any elementary event, representing an atom of our 
knowledge, is characterized by many attributes. As necessary, we decide to focus 
ona small set of them relevant to our operational framework. We move from less 
to more detailed atom descriptions by using conditional probabilities according 
to (1.55). Vice versa we remove unwanted details by a marginalization oper- 
ation that consists in computing unconditional from conditional probabilities. 
This happens in the above two bits example by counting ky?/n as: 


Bae 5 Ma NG, Pu Pia i a 


= = I. 
n n n ka n key n pe) 
which asymptotically reads 36 
P(A) = P(A|B)P(B) + P(A|B°)P(B*) (1.57) 


This is how knowing for instance that the frequency P(AM B) of boring messages 
sent in the weekend is 0.065 affects your knowledge. Namely, due to (1.55) and 
the fact that you receive far fewer messages in the weekend days, say 7% of the 
entire mailbox (which makes a probability P(B) of a weekend message equal 
0.07), you expect to open unwanted messages on those particular days with 
a rate P(A|B)100 of around 92.86% of the messages received in this period. 
To know the percentage of unsoliceted messages at all we need the percentage 
P(A|B°) within the workday e-mails in order to fill up (1.57). If this amounts 
to 80%, we get from (1.57) that almost 80.9% of your inbox list reports junk 
messages. 

Passing to random variables, and considering events such as (X = x),(Y = 
y), we may build the probability function of the two bits variable X = (X1, X2) 
introduced at the beginning of this section as follows: 


Px(x) P(X, z1, X2 z2) 
= P(X = x1)P(X2 = x2|X1 = zı) = Px, (21) Px.|x, (£2) (1.58) 


After the trivial generalization of (1.55) 


P(AN BNC) 


Pa Bo) = P(BNC) 


(1.59) 
and an iterative application of (1.55) itself, we can easily figure the probability 
function of any v bits long random variable X as 27: 


36Usual theorems deriving from this definition and from the fact that the Kolmogorov 
axioms hold for the conditional probability as well are the following: whenever B; O Bj = 
Ø for each i Æ j and U? B; = Q, with P(B;) > 0 for each i, 
m 
P(A)= 5 P(A|B:)P(B:) (total probabilities theorem) 
i=1 


P(A|Bk)P(Bkx) i 
S, PAB JP (B) ,when P(A) >0 (Bayes’ formula) 


P(Bx|A) = 


Random variables 39 


Px (x) = Px, (24) Praia (2) Praline 3) Py e1211 (£v) (1.60) 


While analytically simple and elegant, this expression has the enormous draw- 
back of requiring the knowledge of 2” — 1 parameters to be computed for any 
x. Namely, we need the parameter Px,(1) for computing the probability of 
the first bit, the two parameters Py,),(1) and Px,)9(1) for the second, the four 
parameters Px,j0,0(1), Pxs|0,1(1), Pxs|1,0(1), Pxs)1,1(1), till 2”? for the last bit. 
Consequently this kind of specification of Px is unfeasible in principle. This is 
not dramatic, however, because in Nature too it is very rare to meet situations 
which require so complex a description. In the most cases only a few of the 
2” — 1 parameters need a specific description, while the remaining ones either 
are constant (possibly 0) or have an easy algorithmic description in function of 
the former. For instance all the conditional probabilities equal 4 when X follows 
a uniform distribution. They equal each other for a same number of condition- 
ing bits (whatever their values are) in case of negative exponential distribution 
(see page 343 in Appendix B). In many cases (not in the former!) the relevance 
of the conditional probabilities for fixing Fx diminishes with this number. Thus 
we are induced to find a more concise analytical representation of the probabil- 
ity functions in these easy cases (the expressions we already know, for instance, 
for the mentioned Bernoulli, binomial and hypergeometric distribution laws) 
where only a few necessary parameters are involved, possibly having relevant 
meanings. In the rest of this chapter we will sketch some main strategies for 
discovering these distributions in relation to some very general hypotheses on 
the random variable. 


1.3.4 Beyond the Bernoulli experiment 


As in a bit map where a picture is reconstructed through a set of pixels with 
an intensity (on various bands if the image is colored) possibly thresholded to 
0 or 1, we will build most of the probability universe starting precisely from an 
infinite sequence of independent Bernoulli experiments °° (in a number going to 
infinity as a counterpart of reducing the pixels grain) coded as a sequence of bits 
from a Bernoulli variable X. Like in the line of a picture, we capture features 
emerging from any subsequence, infinite as well, the most elementary ones being 
either the distance between two consecutive 1’s or the sum of the 1’s in a suitable 
normalization. In this scenario the sole maneuvering parameters available are a 
Bernoulli parameter denoting the density of the phenomenon we are observing, 
a length identifying its size in a suitable metering system, and a rate for fixing 
the resolution with which we observe it. Given the high number n of involved 
Bernoulli experiments — henceforth we will call it infinite for short each time 
this attribute can be considered by abuse as a “static” property in place of the 


37 A similar probability chain has been realized in hardware at King’s College of London 
[Clarckson et al., 1992]. It is a device, called pRAM, capable of automatically modifying the 
conditional probabilities in the chain for agreeing with (learning, as will be discussed in next 
chapters) the probability distribution of the data it processes. 


38i.e. experiments with independent events as results. 


is simple but 
costly, 


hence we look for 
more synthetic 


descriptions. 


A huge amount 
of atomic events 


and a window 


in face of three 
tuning 
parameters 


to see either 
events or 
frequencies. 


Whatever target 


of either finite 


40 Knowledge versus randomness 


-t 


CENENE AN iii Ue 


Fig. 1.8: A library shelf as the metaphor of the most random experiments. 


limit of a series — it happens that the frequency of 1’s in the sequence and their 
number in a prominent subset of them do not commute: either the frequency 
vanishes and the number is finite or the frequency is non vanishing but the 
number is infinite, as we will see in a moment. In order to state a link between 
these cases and the physics of unawareness characterizing the experiments we 
may consider them as associated to a set of v (with v — +00) contiguous 
atomic segments — call them atoms — forming a segment of infinite length 7 on 
the real line. Each experiment maps bijectively on one atom in a completely 
random way so that any permutation in the mapping is a possible outcome. 
For instance we may sequentially list the experiments on the real line. Or we 
may alternate the position of the experiments on the left and right of already 
occupied atoms, with a shift on the right when we go on the left tail. The matter 
is that there are some atoms rewarding a benefit but we do not know where they 
are located. Thus all these mapping strategies are equivalent to just select any 
atom by chance, and independently of further past and future selections. In 
this framework the probability 7 that an experiment location falls in a given 
prominent yet finite segment, for fixing the ideas of unitary length, expresses the 
density of the boiling up phenomenon in terms of the ratio between the (fixed) 
number of atoms spanned by the segment and v, namely 7 = ay v, a ratio 
going to 0 with v — +oo. The outcome of the experiment is exactly a label 
x; equal to 1 if the experiment is located inside the segment, and 0 otherwise. 
By linearity, the number of atoms over a segment of length t equals At. This 
length is the phenomenon size parameter: If t is limited, in the sense that it 
does not grow with v, then also the number of rewarding atoms is independent 
of v. Otherwise, also the number of rewarding atoms will tend to +00. 


In the first scenario, we may compare the framework to the job of a librar- 
ian who is searching for books of a Cybernetics collection that he doesn’t know 
where is located on a very long shelf. Assume the books have a same thickness, 
say 2 centimeters, and he knows that we reach around one meter if we stack the 
entire collection. Whatever the shelf length, wherever the books are located, he 
will find around 50 books for filling the task. This is so if he checks the whole 
shelf, a task requesting him to exam v books because of his total unawareness 


Random variables 41 


Fig. 1.9: Shape of the probability function of a Poisson random variable K for different 
values of the parameter A. 


about the library and the exact number of volumes in the collection. Things are 
different if he plans to check a smaller number, say n of books. This number 
(rather n/v, as we will see later) is the third parameter tuning our unaware- 
ness about the phenomenon we are dealing with, rather, from a complementary 
perspective, the resolution with which we are observing it. Namely, peaking 
at random still an infinite (in our modelization) number n of points (i.e. con- 
sidering an infinite subsequence of Bernoulli experiments), with n smaller than 
v however, the number K of points falling in a segment of finite length t is a 
random variable ranging from 0 to +oo. It follows a Poisson distribution law, 
detailed in the Appendix B, as a limit of a binomial distribution of parameters 
n and p, with n > +00 and np —> At (Fig. 1.9 illustrates how the shape of the 
related probability function depends on A). In this case indeed the librarian on 
one hand may keep more than one time the same book since we do not exclude 
it in the experiment; this is why K may be grater than the number of volumes 
in the Cybernetic collection. On the other hand, he will leave some holes in his 
examination of the shelf, which may unfortunately contain some of the books 
he was looking for. We may deal with this inconvenience in two ways. We may 
continue using the probability \/v that we multiply by t to extend it to this 
segment length (thus having the success probability in a single pick) and by n 
to obtain the mean value of the binomial variable that equals np by definition 
(see Example 1.7), with p = (A/v)t. This implies A = A(n/v) from the above 
asymptotic relation, denoting a reduction of the effective due to the men- 
tioned holes, i.e. to the grain with which we are assaying our phenomenon. A 
second way is to remove holes for a moment and define À as the analogous of A 
in the purged file; then reinsert randomly holes (thus, introducing a probability 
(n/v) of having a single wanted book in a non empty slot) which restores to At 
the meaning of expected value of the boiling up binomial variable and to A the 
meaning of the expected number of events falling into a fixed unitary segment. 
In the case we are now considering, where the number of rewarding atoms is 
limited, we do not need any normalization to consider the variable k = )>"_, xi. 


or infinite size. 


Contiguity 
_ between 
rewarding atoms 


42 Knowledge versus randomness 


10 15 


Fig. 1.10: Shape of a Gaussian d.f. with u = 6 and ø = 4. Plain and dashed segments 
highlight the geometrical meaning of the parameters. 


Actually this variable may assume any positive value, as n is infinite. A close 
to infinite value is an accident however, since K specifications spread around a 
finite mean (the true target of the global experiment) with a finite variance. 

Vice versa, if t is comparable to T, At goes to infinity, which ultimately de- 
scends from a non vanishing p. In this case a suitable normalization is obtained 
dividing pee xi directly by n or by any comparable value m. The quantity we 
obtain n 

z= Či i (1.61) 
m 

is a specification of a binomial variable Z scaled by m, with parameters E[Z] = 
“p and variance V[Z] = 4yp(1 — p) ranging both between 0 and +00 for a 
proper choice of m and p as functions of n. In addition, after a linear shift E[Z] 
ranges between —oo and +00 (without affecting the variance, see (B.15)). The 
density function of the continuous variable asymptotically approximating Z is 
described by a Gaussian law (see Appendix B). Its shape follows from the fact 
that, according to our model, the sum Z12 of two random variables Z, and Z2 
having this density function with expected value and variance 41, 0? and u2, 03, 
respectively, follows the same distribution law with parameters 41 + u2 and o? + 
o3. The density function with this property has the precise shape of a Gaussian 
density function. 3° Determining this shape is a somewhat longer analytical job 
based on some additional symmetry properties (see Example B.22); realizing 
the above reproducibility property is easier. We will do so in Sec. 1.3.7. Figure 
1.10 shows the typical shape of a Gaussian probability density function and its 
relations with the parameters u and a,where the latter has the meaning of a 
standard deviation according to Definition B.9. 

Consider the number T of bits set to 0 between two consecutive 1’s in the n 
long sequence and visualize it as the distance between people on a street. If T is 


39rather, the unique random variables X1, X2,... whose sum exactly follows a Gaussian 
distribution law are Gaussian random variables themselves. See [Wilks, 1962] for a technical 
proof. 


Random variables 43 


(a)A=4 (b) A=1 (c) A=5 


Fig. 1.11: Shape of the probability density of a negative exponential random variable 
T for different values of the parameter A. 


large you may say that meeting a person is a rare event; if the distance is close to 
0 you say that the street, and by extension the events, are crowded. In the first 
case T is a number ranging from 0 to +00 where the event (T > r) is equivalent 
to the event: (a sequence of r Bernoulli experiments were unsuccessful, no 
matter what happened later). Its probability P(T > r) = (1—p’)" = (1- 
(np’)/n)"/"), which reads e~*! when n — +00 for a proper normalization of 
T *°. Namely, r/n is replaced by t that makes np’ = à. Thus: i) you may 
reread r for instance either in hours, or in minutes, or in seconds, consequently 
increasing the accuracy with which you scan the time. Correspondingly, n 
represents the number of Bernoulli experiments fitting the time unit you decided 
(say the number of experiments per hour, or minute, or second). In particular, 
ii) for p’ coming back to represent the initial density parameter 7 = d/ v, ie. 
the expected number of successes per time unit and experiment, n represents 
the scale factor between the time unit to which À refer and the time unit to 
which À refers — hence A reads again ÀZ with E[T] = + as we will see later on 
(Appendix B, page 343). 

Since we are referring to a vanishing a v, n may grow to infinity in order 
to use the above exponential asymptote. Moreover, since n increases without 
changing the product At we can approximate T with a continuous variable. For 
A finite this variable describes the interval, in any dimensional space, between 
two events of our interest though completely out of our control; this variable is 
described by a negative exponential distribution law (see Appendix B), whose 
density function is illustrated in Fig. 1.11 for different values of A. By the sole 
fact that this distance may assume any value we have the the events are rare. 
On the contrary, for A + +00 P(T > r) — 0 for any r and any T scale, which 
denotes crowded events “1. 

These two frameworks, Poisson and Gaussian, i.e. rare and crowded events, 


40 as (1 + a)y™ — e° with m going to infinite for finite a. 
“1It is the same kind of events we considered above with \/v still vanishing, yet a non 
vanishing resolution n/v and a phenomenon size t going to infinity. 


A suitable time 
scale 


Rare events, 
crowded events, 


and all that 


Breaking bit 
strings in a 
meaningful way 


A mechanical 
relation at, the 
basis of 
stochastic 
dependence 


44 Knowledge versus randomness 


are at the basis of most random variables of general use. Further variables come 
from an enrichment of the frameworks by introducing other free parameters. But 
those we will not deal with in the book. Rather, we will consider extensions 
coming from either aggregates of random variables or functions of them. 


1.3.5 Aggregates of random variables 


Encouraged by the above results we may decide to associate any mailbox mes- 
sage with a long record reporting information about the sending date, the indices 
in Table 1.6, the topic treated in the e-mail, etc. In search of a suitable inter- 
mediate strategy between dealing with this record as a unique variable (whose 
order relation is a nonsense) and a set of Bernoulli variables (whose drawbacks, 
as mentioned, consist in the large number of conditional probabilities to be con- 
sidered for their handling) we may decide to group them and eventually recode 
the groups in order to obtain a limited number of true metric variables. They 
are generally framed into a vector. But this is an overstructure, since gener- 
ally there is no order relation between its components. What is important is 
that a relation generally exists between them. It is not a sharp dependence 
like “if the message was received on the weekend then it is boring”, since we 
know for instance that only 92.86% of weekend messages are so, thus we speak 
of a stochastic dependence. It essentially says that there is a hidden variable 
discriminating between weekend messages obeying the above rule and messages 
which do not, where the former are the 92.86% of the total. This looks for 
an algorithmic interpretation of the stochastic dependence, as the effect of an 
algorithm connecting events that are known only in part; grasping pieces of the 
unknown part is a typical inference target we will discuss later. Referring to 
metric variables we can recover this interpretation within the following scheme. 

By first, speaking of independence between random variables X; and X2 we 
focus on the joint events {(X1 = zı N X2 = x2)} for any pair of specifications 
of the variables, and formalize the following definition. 


Definition 1.23. Two discrete random variables X; and Xə are independent if 
P(X, = 219 Xə = z2) = Px, (41) Px, (v2), Vr1, £2 E€ Dx, x Dx,, with obvious 
extensions in the cases of case of: 


e either both continuous X; and Xə or discrete X; and continuous Xə (see 
Section B.4.2), 


e sets of more than two random variables (see Definition B.15). 
Oo 


Remark 1.7. With this notation the above requirements on the full expansion of 
the observations’ symmetry ensembles may be exactly translated into indepen- 
dence requirements between the variables {X1,..., Xm} constituting a sample 
of X, as will be definitely stated in Definition 2.1 in the next chapter. 


Random variables 45 


(a) (b) 


Fig. 1.12: Possible Venn diagrams for the events A = “the message is unsolicited” and 
B = “the message is long”. (a) Non disjoint events. (b) Disjoint events. 


Now, imagine we want to understand the stochastic relation between 
boredom degree D of a message and its length L. In line with what 
is discussed above, we can figure D and L as the sum of n pairs of 
Bernoulli variables jointly taken in a long sequence of experiments. Namely 
we can figure two elementary events of the outcome space Q, the seeds 
a; = “a boredom atom (actually the i-th one) occurs in the message” +? and 
bj = “a unitary term (actually the j-th one) adds to its length”. With proper 
ranges of i and j (which have nothing to do with n), they group respectively 
in the events A and B represented through Venn diagrams looks like in Fig. 
Lie, 

Analogously as in Sec. 1.3.4, a single experiment consists in uniformly draw- 
ing one among the v elementary events constituting Q. The event may belong 
to either A or B or none of them. Let n be the length of the joint experiments 
sequence and denote by Np and Nz the number of successes in respect to the 
events recalled in the indices; at each experiment we can increase by one unit 
either Np or Nz or both. In case A and B are disjoint (Fig. 1.12(b)) we have 
that on each experiment only one N may increase; moreover, if n is large and 
P(B) is much smaller than 1 (i.e. such that n — y ~ n and P(Q — B)” œ 1), we 
have that 


P(Np = x|Nz = y) 
P(x successes out of (n — y) Bernoulli experiments of parameter P(A)) 
7 P(Q — B)r-y 


~ (para —P(A))"-* =P(Np =2) (1.62) 


x 


having divided both numerator and denominator of the intermediate term by 


42Once again we are looking for identifying the atoms of our boiling up knowledge. 
43 where events are abstractly associated to domains in a plane, typically ellipses or convex 
domains in general, highlighting relevant set properties such as contiguity, inclusion, etc. 


Some atomic 
components are 


mutually 
exclusive, 


others share the 
dependent 
variables. 


Reiterating the 
random variable 
generation 


46 Knowledge versus randomness 


(GPW successes out of y Bernoulli experiments of parameter P(B)); and anal- 
ogously for Nz. Thus, in spite of the fact that both A and B fill up a same 
stack of limited capacity n, under the above hypotheses Np and Nz behave 
as independent random variables. In other words, the distances (in number of 
elementary experiments) between two successes are generally so large that the 
two events do not influence each other. If n is small and P(A), P(B) are large 
each success of A withdraws room to B’s successes, thus P(Nz > y|Np = 2) is 
always less than P(Nz > y) for each x > 0 and decreases with x increasing. In 
the case A and B overlap (Fig. 1.12(a)) we have the balance of two effects. If a 
success occurs in the intersection of A with B it represents a success of A that 
determines a success of B as well. If a success occurs in the region A— ANB then 
we have the same effect as for disjoint sets. The balancing of the two actions 
depends on the degree of overlapping of the two atomic events and determines 
the sign and the amount of dependence between the random variables D and 
i 

Wanting to consider how a set of random variables behaves jointly (i.e. with 
which frequency takes values) we must glue the distribution laws of the single 
variables through a measure of their mutual stochastic dependence. In Appendix 
B we report the distribution laws of some special joint random variables that 
do not derive expressly from (1.58). Joint variables vary in the product space, 
i.e. in R” if v is the cardinality of the variables’ set. Therefore we need an 
extension of the standard functions describing distribution laws (Px, Fx and 
fx) to realize mappings from the product space to R (see Appendix B). 


1.3.6 Functions of random variables and their generators 


Since (X = x) is an event when X is a random variable, we can define new 
random variables on these events. Thus we may obtain a random variable 
Y = g(X) with g computing a special property on X we are interested in. 
There is no constraint on the shape of g *°. Y is defined on the whole real line 
R, like X. Being an univocal mapping it automatically transfers to a given y 
the sum of the probabilities of each x such that g(x) = y, which defines the 
probability function Py(y). We must devote some attention to managing this 
transfer when dealing with continuous random variables, as shown in Appendix 
B. 

This change of variables may be seen as an abstraction process. Starting with 
X and its probabilities, we focus on a particular property of X stemmed by Y. 
For this new variable we can compute the probability function Py and then 
forget completely about X. Actually our classification of incoming messages as 
“boring” or “non boring” that we associate to a Bernoulli variable X generally 
comes from a thresholding of another random variable Z quantifying the interest 


44Note that the fact that (Np = x) and (Nz = y) are disjoint events implies that they 
are not independent (unless one of the two events has probability 0). Here we are linking the 
independence of D and L to disjointness of the underlying atomic events A and B. 

4 except for its computability as mentioned before. 


Random variables 47 


SENS Y 


(a) (b) 


Fig. 1.13: Inverting the c.d.f. of a distribution in order to generate its specifications 
from those of U. (a) Continuous distribution: Gaussian with u = 1 and ø = 0.4. (b) 
Discrete distribution: Dx = {1,2,3,4} and p = {0.1, 0.3, 0.5, 0.1}. 


degree of the message. Namely for a suitable threshold 0 


1ifZ <9, 
X= ; (1.63) 
0 otherwise. 


Thus the parameter p of X is p = Fz(@). As you may notice however we had 
no need of knowing 0 for all discussions about properties of X. Coming back 
from Y to X is not always possible, since to recover Px it is necessary for g to 
be bijective. 

In the following we will manage many functions of random variables, possibly 
having more variables in input. For instance we already considered in (1.29) a 
specification 24 of a variable Xtot coming from n specifications of X to take 
an estimate of ux. Actually, Xtot = >>;_, Xi takes random values as a function 
of the random values taken by the set of n variables X;. In some cases, like the 
latter, it may occur that the distribution law of the new variable is less complex 
than the originals’. We have already defined the operator E[g(X)]. It is nice to 
check that, as Y = g(X) we have: 


ElY]= SO yPy@) = So 9(@)Px(x) = Elg(X)] (1.64) 


yEDy xEDx 


where Dy is the image of Dx through g, i.e. the set of all y’s such that an 
x € Dx exists for which g(x) = y. 

A special function we will use in the following is Y = F(X). Its peculiarity 
is that, for whatever continuous X, Fy(y) = y for y € [0,1], ic. Y follows a 
uniform distribution law over the continuous set [0,1]. Indeed 46 


Fy (y) = P(Fx(X) < y) = P(X < Fy" (y)) = Fx (Fx (y)) =y (1.65) 


46 This statement is commonly known as the probability integral transformation theorem 
[Rohatgi, 1976]. 


An invariance 
property 


A universal, 
random variable 
generator 


48 Knowledge versus randomness 


Fig. 1.14: Producing a negative exponential variable X with A = 2.5. Gray plot: 
empirical c.d.f. from 1, 000 specifications of X obtained through the instruction (1.69). 
Black plot: analytical c.d.f. of X (almost completely overlapping the gray curve). 


where by g7! we denote the inverse function of g, i.e. a function such that 


for continuous g~‘(g(x)) = x for each x € Dx *. Thus, given an experiment producing spec- 
variables, * i : g : : 
ifications {u;} of a [0, 1]-uniform random variable U, we obtain an experiment 
producing specifications {x;} of a continuous random variable X via the simple 
routine (see Fig. 1.13(a)): 


Indeed, denoting Z the variable Fy '(U), whose specifications we are producing, 
we have: 
Fz(z) =P(Fy (U) < z) = P(U < Fx(z)) = Fee) (1.67) 


Example 1.10. A random variable X distributed according to a negative ex- 
ponential law of parameter A € Rt has c.d.f. Fx(x) = 1 — e~**I,400)(2). 
Therefore 


Fz (u) = -5 n(1 — u)L(o,1) (u) (1.68) 


Since 1 — U has the same distribution of U, the following instruction generates 
a specification x of X having in input a specification u of U: 


n= -5Inu (1.69) 


Figure 1.14 shows a comparison between the empirical c.d.f. of a set of 

1000 values returned by this algorithm and the negative exponential distribu- 

So many tion c.d.f., for A = 2.5. To obtain these values we applied (1.69) to a set of 

independent wie Specifications of 1000 independent random variables U;’s generated by another 
algorithm that we discuss in a moment. 


for discrete An analogous transform works in the case of a discrete random variable K. 
variables. 


47Such a function always exists for g = Fx thanks to the monotonicity and continuity of 
Fx, with some caveat in the possible X intervals where Fx is constant. 


Random variables 49 


Namely, divide the unitary interval U in segments 7; of length pi = Pg (xi) * 


for an indexing i of the K specifications (see Fig. 1.13(b)). Now the routine is 


k= Tj SÈ Ui E Tj (1.70) 
Indeed, the rule reads 
j=1 j 
k= p< uY p (1.71) 
r=0 r=0 


with po = 0. Consequently 


j-l j j j=l 
P=) =P ($ <ur) = Yon Soman (1.72) 
r=0 r=0 r=0 r=0 


for each j such that xj is a specification of X. 


Example 1.11. The geometric distribution of parameter p € (0, 1] has probabil- 
ity function Px(#) = (1—p)*pIn(2) (see page 335 in Appendix B). Its sampling 
through the direct application of (1.71) would require the storage of an infinite 
number of delimiters of 7; segments. However, since 
S j S es le i+1 
PLA J pr =P) tp M (1.73) 
j=0 j=0 


being q = 1 — p, we might implement (1.71) by assigning to X the smallest 
integer i such that g’t! < 1 — U. This brings to the following instruction for 
generating a specification x of X having in input a specification u of U: 


z= E (1.74) 


where |a| denotes the floor of a, i.e. the highest integer less or equal to a, u 
replaces 1 — u like in the previous example. 

Figure 1.15 shows a comparison between the empirical c.d.f. of a set of 
1,000 values returned by this algorithm and the geometric distribution c.d.f., 
for p = 0.4. 


Remark 1.8. The above examples introduce random variables in a true nat- 
ural way as a tool for synthesizing, for instance via empirical c.d.f., observed 
samples. In turn samples are thought as the output of a general algorithm pro- 
ducing their elements from companion elements of a uniform random variable 
through the mapping either (1.66) or (1.70). Denoting this mapping go, for a 
suitable parametrization, and speaking of random variable independent speci- 
fications to refer to specifications of independent random variables having the 
same distribution law, the procedure is sketched in Algorithm 1.1. oO 


48 Note that while this operation may be complex, it is feasible even if K assumes an infinitely 
enumerable set {xi} of values with probability greater than 0. 


Who generates 


U? 
Our PC with a 
well done 


50 Knowledge versus randomness 


Fig. 1.15: Producing a geometric variable X with p = 0.4. Gray plot: empirical c.d.f. 
from 1,000 specifications of X obtained through the instruction (1.74). Black plot: 
analytical c.d.f. of X (almost completely hiding the former one). 


Algorithm 1.1 Generating m items of X random sample 


1. set go as in (1.66) or (1.70) 


2. for i=1 to m 
draw an independent specification u; from a [0,1] uniform variable 


xi = go (ui) 


Managing functions of random variables, we are facing with an abstraction chain 
that considers properties of a random variable in turn expressing properties of 
another random variable, i.e. that recursively composes algorithms on a given 
random input. Its starting point is still represented by U. This is a variable that 
does not express any property per se except for the crucial fact that the variable 
is limited, so we can check that the whole unitary mass has been spanned on it. 
Therefore, as will be shown below, we are left again with the original problem 
of being unable to control the sequence order in a set of data. Now, being 
at the start term of the abstraction chain, we cannot rely on other random 
variables, only on an algorithm with fixed (possibly null) input. Let us consider 
the two constituting properties of a generator of a sequence of independent U; 
specifications, for any discretization grain 6, i.e. with whatever number p of bits 
the u;’s are described: 


e the asymptotic frequency of the generated values in the sequence is the 
same; 


e the asymptotic frequency of any sequence of values equals the product of 
the analogous frequencies of the single values. 


The former is met by any periodical production of the specifications. The 


program second condition translates in the fact that our algorithm Y generates u;’s in an 


unpredictable way. That is, no algorithm & exists in our computational context 
C that, on the basis of the log of the previous produced u,’s, predicts the next 


Random variables 51 


specification with a probability different from the one of an equiprobable experi- 
ment over all possible outputs. This means that the probability that the outputs 
of the two algorithms coincide is 1/2°. We may design Y relying, as in a roulette, 
on the joint intervention of numerous variables such as croupier strength, wheel 
friction, local winds, etc. (whose values at each run we do not know), which 
destroy any regularity in the number production. Otherwise we can rely on 
the complexity of inverting Y. Typically, we may use a Y which computes a 
specification sequence quickly depending on a few initial parameters, yet proves 
difficult to be identified by another algorithm æ from the produced sequence. 
In sum, no computation on the u,;’s senses, and therefore may be affected by, 
the relations between the u;’s produced by ¥. The inversion difficulty may be 
modulated according to the computational resources available. For instance, the 
class of one way functions [Diffie and Hellman, 1976] is constituted by functions 
requiring in the worst case a running time almost exponential in the number 
p of bits describing the u;’s to be inverted 4°. Thus, in the current technology 
status where this amount of running time is unaffordable by any computer for 
p > 100, a sequence of so long u;’s meets the second randomness condition and 
we can say that a one way function is a random number generator °°. As a 
matter of fact the usual functions involved in everyday computation are never 
affected by the relations hidden in u; sequences generated recursively through 
the following function (see Algorithm B.1 in Appendix B): 


a aup—1 modn (1.75) 


n 


where b mod c denotes the remainder of the integer division of b by c, uo, a,n € 
N are suitably fixed by the user °! with a and n being relatively prime and as big 
as the available memory allows. This routine is included in the library of many 
operating systems and programming languages. As the routine is completely 
deterministic, to produce different sequences we need to use a different starting 
value each time. A good choice for it is the number of seconds from a fixed date 
till the current time, wrapped in the interval [0, n]. To follow an old notation and 
also warn the user that its randomness could be broken by a skillful algorithm, 
we are used to saying that we simulate a sequence of random numbers when 
we generate it by a computer. Actually nobody can guarantee that Nature 
too simulates the random sequences supplies us. This is a hypothesis flavoring 
the entire matter of this book yet affecting none of the results and operational 
directions here reported. 


49 More precisely greater than any polynomial, see the reference suggested in the last section 
for a more precise definition. 

50Classical probability theory disregards the computational constraints and qualifies the 
output of such a function as pseudo-random numbers. 

51See [Ripley, 1987] for possible bad values of these constants. 


Random since 
very difficult to 
predict 


but easy to 
produce 


say, to simulate. 


Mixing crowded 
events we get 
another crowded 
event 


Some variables 
reproduce their 
shape 


52 Knowledge versus randomness 


1.3.7 Limit theorems 


Consider a pair of replicas of the segment of infinite length 7 made up of an 
infinite number v of atomic segments as discussed in Sec. 1.3.4. Assume t 
comparable with 7 so that the random variable Z, = + Ya Xi,k, is a Gaussian 
variable (for k = 1,2), with n the number of experiments drawn on each segment, 
and X;,, = 1 if the it experiment drawn on the kt replica maps within a 
fixed t long part of r. Now mix the two sets of experiments, i.e. you draw 
2n experiments each mapping in any atom from among the 2v atoms filling 
the joined segment of length 27. What distinguishes the first case from the 
second is that in the former you have exactly n experiments mapping in the 
first segment and n in the second. In the latter case, after counting how many 
of the 2n experiments mapped in the first segment, how many in the second, you 
may expect that the numbers you find are not equal. Indeed the number N, of 
experiments falling in the first segment is a binomial variable of parameters 1/2 
and 2n. But since we are interested in the variable rescaled by n, the probability 
that Nı moves far from its mean = for more than any fixed percentage (i.e. 
N,/(2n) far from 0.5 more than an € € [0,1]) goes to 0 with n. Thus the sum 
of the two Gaussian variables Z, and Zə referring to the two single segments 
is still a Gaussian variable Z12, where, according to the mentioned section, 
E[Z,2] = 2np/n, ie. the sum of the means, and V[Z12] = 2np(1 — p)/n?, i.e. 
the sum of the variances. The same arguments hold if 


e the number of Bernoulli experiments involved in the two segments are 
different, say n and n2, since this proportion is asymptotically maintained 
also in the mixed sample; 


e the scale and probability parameters n and p are different in the two 
segments, still in force of the asymptotic persistence of the above propor- 
tionality; 


e the segments are more than two. 
This leads to the well known property: 


Fact 1.3. The sum of independent Gaussian random variables follows a Gaus- 
sian distribution having as mean and variance the sum of the original variables’ 
means and variances, respectively. 


We call reproducibility the property of maintaining over the sum the shape 
of the distribution law. Note that the reproducibility of the binomial variable 
for p fixed holds for whatever involved n;’s (not necessarily large) simply by 
definition, namely: 


Fact 1.4. The sum of independent binomial random variables having the same 
parameter p € [0,1] follows a binomial distribution having as integer parameter 
the sum of the original variables’ integer parameters. 


The same holds for other random variables, such as Poisson, Gamma etc. 
(see fact B.6), still defined as a sum of elementary variables per se. 


Random variables 53 


Now consider any random variable X in its discrete form and the generation 
mechanism described in Sec. 1.3.6. Thus consider the related partition of the 
unitary segment U where a uniform variable specification u; determines the 
value x; depending on the subset T of U it hits. Finally, consider a set of U 
specifications. Focusing on a single 7, you may read this scenario in terms of 
the infinite sequence of n Bernoulli variables at the basis of the Gaussian model. 
Focusing on a m, you may do the same, while simultaneously considering both Tk 
and T, requires the use of the same n experiments for the generation of the two 
Gaussian variables Z, and Zp, which induces a stochastic dependence between 
the variables. According to Sec. 1.3.5 however this dependence vanishes since 
we draw the u;’s from a very large set of cardinality n going to infinity indeed, 
mapping into very fine grain 7’s (someone possibly addressing to the same X 
specification). Thus, we compute )*)"_, x; by grouping the addends by the same 
values of corresponding 7’s, namely ae r;z(T;), where r; is the number of U 
specifications falling on 7; and x(7;) the corresponding value assumed by X. This 
reads a linear combination of sums of independent Bernoulli variables. Hence 
we can use Fact 1.3 coming to the conclusion: 


Fact 1.5. The normalized sum of an infinite number of independent random 
variables having the same distribution law converges to a Gaussian variable. 


A more rigorous enunciation of this fact taking into account some caveats is 
the following 


Theorem 1.4 (Central Limit Theorem). Denoting with S the sum of n inde- 
pendent random variables having a same distribution law with finite mean u and 
variance o°, we have: 

„im, Psp (t) = Fz(t) (1.76) 


where Z is a Gaussian distribution law with E|Z] = 0 and V[Z] = 1. 


We can however relax this form in many ways, taking into account that the 
above arguments hold also for some infinite sequence of unitary segments par- 
titioned not exactly in the same way. Thus the sentence is proof against some 
differences between the addend distribution laws, a moderate mutual depen- 
dency, and so on. 

Summing up, we started the whole discussion in this chapter from an infinite 
sequence of observations (£1,..., £n) with n going to infinity. Then we read it 
as a sequence of uniform values (w1,...,Un) hitting partitions of the unitary 
segment and discover that the sum S' of the x;’s may be described equivalently 
by the sum of infinite successes encountered in an infinitely large subset of a set 
of v atoms a part vp of which decrees a success with p related to E[X] through 
proper scale factor. Of course, the more the subset size increases the more the 
percentage of the successful atoms gets close to p, with equality holding when the 
subset coincides with the entire set. Since v is infinite and the size of the subset 
is related to the length of our observation record this condition will never be 
satisfied. We have however analytical tools for controlling the proximity of the 


A huge mix of 
any kind events 
still generates a 
crowded event 


Crowdiness 
depends on 


normalization 


Crowded events 
operationally 


stemmed by a 
crowded events 
theory 


54 Knowledge versus randomness 


two quantities, whose explanation will further perspectives for reinforcing our 
intuition on the matter. The last theorem lets us give a shape to the dispersion 
of S around the mean with non vanishing p. The next chapter will provide 
methods for quantitatively appreciating this dispersion even in case of a limited 
number of terms in the sum and/or vanishing p, as basic links between observed 
data and theoretical models about them. 


1.4 Bibliographical notes and further readings 


The approach to the probability we follow in this book is quite different from 
the ones we most commonly see in the literature: in an extreme summary it 
reads “by first the observed data, then a tool for synthesizing their properties in 
a way, the probability, useful for guessing characteristics of new observations”. 
We start from the universally accepted assertion by Laplace made in the early 
19th century: “Probability theory is nothing but common sense reduced to cal- 
culation” [Laplace, 1812] and from his general idea that: “The theory of chance 
consists in reducing all the events of the same kind to a certain number of cases 
equally possible, that is to say, to such as we may be equally undecided about in 
regard to their existence, and in determining the number of cases favorable to 
the event whose probability is sought” [Laplace, 1814]. Then, avoiding getting 
stuck in metaphysical traps, we found it useful to interpret equal possibility in 
cognitive terms. Namely we have our computational context (see Def. 1.1) as 
the true prior of our framework, and state a symmetry relation between strings 
of data in terms of their indistinguishability w.r.t. the available computational 
tools we have in order to get a given target. The computational context is the 
melting bath where past and future data lie, where their symmetry translates 
into uncertainty about the future population that we may describe in terms of 
probabilities as a function of past observed data. This implies, as a first conse- 
quence, that the parameters of the distribution laws of our interest are random 
variables themselves. A pioneer of this idea was R. A. Fisher who speaks [Fisher, 
1935] of “the probability that u (mean of a Gaussian variable — our note) is less 
than any assigned value, or the probability that it lies between any assigned 
values, or, in short, its probability distribution, in the light of the sample ob- 
served”. He underlines that this is not a Bayesian a priori probability [Florens 
et al., 1990], but just comes from the observed sample. Actually its formulation 
was not so clear, and he disagreed with the interpretation supplied by Fraser 
[Fraser, 1958] and other statisticians, so that he disclaims about “a doctrine 
which has been accepted and developed by the most eminent men of their time, 
and is now perhaps accepted by men now living, which at the same time has 
appeared to a succession of sound writers to be fundamentally false and devoid 
of foundation”. Actually, Fisher makes a distinction between probability mea- 
sure on random variables and analogous measure on parameters that he calls 
fiducial probability. Tukey [Tukey, 1947] offers a contribution to removing this 
distinction by fixing the distribution law of the probability measure of domains 
bounded by order statistics (see Sec. 4.4 later on), a law (the Beta distribu- 


? 


Bibliographical notes and further readings 55 


tion, see Appendix B) that is independent of the probability distribution with 
which the sample has been drawn. This independence says that, given a sam- 
ple, we may make no assumption about the distribution, except that the future 
data too are measured according to this distribution. We give a further read- 
ing of this framework in light of the predictive inference approach [De Finetti, 
1975, Geisser, 1993]: given actual observations predict properties on the subse- 
quent ones. Thus it is not unusual to consider a probability of any population 
property as a random variable per se. 

Our direct companion reference framework is the one proposed by Kol- 
mogorov [Kolmogorov, 1933]. We use the same descriptive tools supplied by 
the classical Set Theory [Lévy, 1979], just giving mention to fuzzy sets [Cox, 
1998] that we read as a relaxed version of these tools. For a deeper compar- 
ison between Kolmogorov and the Fuzzy Sets framework the reader can refer 
to [Thomas, 1995]. From an operational perspective our approach is rooted in 
the combinatorial calculus that, not coincidentally, represents one of the former 
challenges in Theoretical Computer Science [Percus, 1971, Aho et al., 1974]. 
But we escape the difficulties of this calculus as soon as possible taking shelter 
in asymptotic results. Here we find a point of merger with the Kolmogorov ap- 
proach. From a true modeling perspective, i.e. for considering the asymptotic 
properties of a fixed population of a random variable, we massively use results of 
the axiomatic Kolmogorov probability theory. Plenty of books provide a more 
extensive explanation of this. Out of personal preference or mere familiarity, 
we refer to [Mood et al., 1974] for a more didactical treatment and to [Wilks, 
1962, Rohatgi, 1976] for a more mathematical discussion with a slight differ- 
ent exposition style and contents list between them. Even with these results, 
however, we aimed to stress the operational meaning of both probabilistic tools 
and models. In this we were mainly inspired by Kolmogorov’s last ideas. In 
the final period of his rich and vast scientific activity he related the randomness 
of a single data item to the algorithmic structure underlying the set of bits 
constituting its representation [Kolmogorov, 1965]. Thus, for instance a string 
of all 0’s is less random in general than a string of perhaps 110110010001 that 
you can describe no more concisely than by just repeating the sequence of its 
bits. The interested reader can find an exhaustive exposition of this approach 
in [Li and Vitanyi, 1997]. In Chapter 5 we will use a few results coming ex- 
actly from this idea of randomness. In this chapter rather we were compelled 
to show the operational meaning of some classical theory results in terms of 
inner mechanisms affecting either special features of random variables, such as 
their independence, or overall probabilistic models. The general starting point 
is constituted by a set of uniformly distributed atomic events plus a specific 
structure between them. 

Since structure between data is computationally realized by algorithms we 
recall some results of theoretical computer science. The main goal is to under- 
stand the complexity of some algorithms, or mere existence, as a measure of the 
randomness of the produced strings, in the mentioned Kolmogorov algorithmic 
perspective. In this direction the reader can delve deeper into questions con- 
nected with the computability of the functions in books like [Roger, 1967] or 


56 Knowledge versus randomness 


the difficulty of their computation in books like [Garey and Johnson, 1978, Pa- 
padimitriou, 1994]. The particular class of one way functions at the basis of 
generation of random strings through a regular computer is a favorite topic of 
cryptography books like [Goldreich, 1999]. 

Finally the generation of random variables is a facility very often employed 
for solving computational problems, possibly not directly involved with random 
phenomena. In the literature it is generally denoted as random variable simu- 
lation in order to not hurt the axiomatic probability approach. In our cognitive 
perspective, instead, we may not figure a random number generator that is not 
an algorithm of our computational context. It is exactly this routine at the root 
of the inference methods we will develop in the next chapter. Random gener- 
ators are commonly available in many software libraries, such as [Little, 2003] 
and [Fishwick, 2003], while many books are devoted to studying algorithms for 
efficiently simulating complex joint distribution laws of many variables (see for 
instance [Ross, 1997, Ripley, 1987]). 


2 — Algorithmic inference 


As stated right from the start, the goal of this book is to devise some methods 
for organizing the knowledge we achieve from a set of data in a way that will tell 
us something useful about future data we will meet in the same framework. In 
some cases we are lucky enough to discover an algorithm — for instance F = ma 
— outputting exactly these unknown data in function of what we have observed. 
The same might not happen for the messages filling our mailbox, as pointed out 
in the previous chapter. Maybe this is due merely to our insufficient knowledge 
about the existing software library or to our scarce ability in producing our own 
software. In Sec. 1.3.6, however, we showed that, with our computer too ' we 
may produce sequences of data so odd that even a skilled programmer endowed 
with realistic computational resources is unable to forecast them. In such cases 
we abandon the ambitious idea of predicting exactly the sequence of future data 
and aim instead at fixing some general properties about them, such as their sum 
in a given normalization or more complex parameters. 


sample population 


Fig. 2.1: Sample and population of random bits. 


2.1 The predictive approach: a string of bits parti- 
tioned by a cursor 


In a very simple figuration, the target of our predicting methods is a string of 
data, binary in standard representation, divided by a cursor (see Fig. 2.1). The 


lin the commonly accepted conjecture that one way functions exist. 


57 


Sample now, 
population later 


Future in 
agreement with 
present 


Hypothetical 
properties versus 
observed 
properties 


A same 
explaining 
algorithm for 
both sample and 
population 


Probability 
square root may 
hide a 
normalization 
dimension 


Either a set of 
values 


58 Algorithmic inference 


part before it represents what we have already observed and therefore call sam- 
ple. The part after the cursor is the data we will meet in the future, actually 
in a number finite like our life, possibly approximated by an infinite sequence 
coinciding with the population associated to a random variable. It is exactly the 
scenario within which we defined our probability notion. In an extremely prin- 
cipled formulation we should operate here with feature and experiment spaces, 
but we prefer to fix in advance the outcome space in order to comfortably exploit 
the probabilistic tools discussed in the previous chapter. 


2.1.1 The inference approach 


It is crucial to note that, though infinite, this population still remains non 
written, being a possible continuation of our observation story. In other words, 
our operational framework is constituted not by a sample of data drawn from 
a preexisting population? but by a set of data we actually know. Then, by 
definition the future will come later yet only from among the sequences of data 
that are compatible with the former. As in the case of (1.9) in Sec. 1.2.1 we will 
in principle be able to compute a probability distribution P over these sequences 
as a measurable counterpart of the compatibility concept. This means that the 
properties we want to discover — the object of our inference, put in standard 
terms — are random quantities as well that we will describe through random 
variables, whose distribution laws derive exactly from P. The general scheme 
is depicted in Fig. 2.2. We have on the one hand the world of guesses about 
properties of the population, which we call II and on the other the world of 
actual observations where — as the phenomenon we refer to is the same — the 
above hypotheses result in corresponding properties 7 of the sample. So we 
can use the probability of the sample meeting m to get the probability that the 
corresponding II are satisfied. Relating the sample variables to uniform ones as 
discussed in Sec. 1.3.6 allows us to read all the matter boiling up in terms of the 
identification of mappings like (1.66) or (1.71) matching both 7 and II. Hence 
we give to our predicting methods the name of algorithmic inference. 

It is worth noting the time scales. While we get the sample in a finite 
observation time, the properties we refer to in the future are either intensive, 
such as the minimum or maximum of the population, or cumulative (such as the 
sum of the data per se), which must be normalized on a limited time interval. 
We can also have properties normalized in two dimensions such as time and 
space but observed solely along the time dimension without marginalizing along 
the other. In cases like this (for instance in quantum mechanics) we virtually 
manage square roots of probabilities without considering the second dimension. 

Actually, computable numerical values are the final result of the proposed 
logical mechanism, usually called inference. Quantiles® of the goal properties 
are, for instance, the true inputs for any decision we may write down in our 


*the Aristotelic framework: fate is already written in the stars, we “poor” humans try to 
discover it from some clues we had the venture of observing. 
3see Definition B.5 in Appendix B. 


The predictive approach: a string of bits partitioned by a cursor 59 


Sample Population 
world of world of 
. —— property m property II 
observations U guesses 


T1... LmEmp1l..-. 
P(r is observed) IO o> P(II is true) 


Fig. 2.2: Twisting properties between sample and population. 


operational framework. A typical way of using them is represented by confi- 
dence intervals, i.e. intervals where we expect the random properties to belong 
with good probability. In this part of the book we will see various tools for 
getting quantiles even when their computation is analytically unfeasible or un- 
determined due to a lack of information on the boiling up random variables. 

Moreover, as for a generic distribution law, we may find it convenient to refer or a single value 
to less detailed information, generally relying on a single value as representative 
of the property. This is a wide chapter of classical statistics, entitled point 
estimation, that we will revisit from our particular perspective. The main point 
is that we are looking for a sentence on the sample that constitutes an estimate 
of a population property; and we want it proving a suitable estimate in spite for capturing a 
of the variability of the populations compatible with the sample. This is a Panton. oia 
perspective dual in comparison to the classical approach that searches for an Population 
estimate computed on a sample drawn from a given population and proving 
suitable in spite of the variability of the samples that may be drawn from such 
population. The second approach has, however, the great benefit of being less 
computational demanding in general and we will use it when we want to simplify 
some inference procedures. 

A further aspect of the inference that emerged with the access to computer as the output of 
facilities concerns the kind of properties we are interested in. In the past we ei 
dared to consider only some very simple properties, e.g. mean and variance of computation. 
a random variable or of a simple function of it, whose counterpart computation 
on the sample did not prove heavy (what we could call abacus statistics). Today 
computers let us pose more complex queries. A property may be the output of a 
long computation on observations. In cases like this it is customary to speak of 
learning rather than of inferring activity. This will be the topic of next chapters. 


60 Algorithmic inference 


2.1.2 A universal sample generator 


Coming back to the general picture in Fig. 2.1 we may assume that each of the 

v bits in the written string represents a specification of a random variable X 

and that no further relation exists between the bits. In other words, we make 

the strong hypothesis at the moment that we have no program in our software 
library computing anything of interest to us from this string other than the 
programs having in input blocks of v bits. According to the random variable 
generation scheme in Sec. 1.3.6, this corresponds to assuming the various blocks 
as a representative of specification x; of a random variable X;, each distributed 
Specifications of as X and independent of the others. We resume this assumption saying that 
ies eer the part of the string before the cursor is a specification x = {21, X2,..., 2m} of 
random variable a random sample X = { X1, X2, ..., Xm}. The rest is the population (or a subset 
of it), i.e. a longer (possibly infinite) sequence of X;’s whose specifications will 
affect our operational framework but cannot be known in advance. 
Moreover, still in light of Sec. 1.3.6, we may assume {21,22,...,%m} as the 
The mother of all outputs of some function gg having input {w1, U2, ..., Um} from a set of inde- 
samples Š š : ‘ 5 : : 

pendent random variables uniformly distributed in the unitary interval — effec- 

tively, the most essential source of randomness — constituting the random seeds 

(or globally the seed) of {X1, X2,..., Xm} (see Algorithm 1.1). Namely, go is a 

A generalized generalized inverse Fy! of the c.d.f. Fx of X. You get Fz! (a) as the abscissa 
inverse Fx' of the (actual or dummy) point q of the Fx graph with ordinate qy = a. In 
the case of continuous random variables, q is an actual point of coordinates 

(qx|F'x (dx) = a, a) (see Fig. 2.3(a)). In the case of discrete variables, you con- 

sider the stepwise graph as in Fig. 2.3(b), where vertical bars are added to render 

it connected; F z} (a) is either the abscissa of the dummy point q with ordinate 

a belonging to a vertical bar, or the left extreme a’ of a step of Fx graph if a 

equals exactly the height of this step w.r.t. the x axis 4. In conclusion, F x 1 

is the inverse Fy 1 of a cumulative distribution function, if X is a continuous 

random variable, or the function encoded in (1.71) for X discrete °. 
In turn, x is the input for any kind of actual computation which we hence 
denote a statistic. 
The statistical — This is our sampling framework, which we fix through the following defini- 
tion: 

Definition 2.1. Given a string of data x = {21,22,...,%m}, we call it a iid. 
mechan? certain sample specification if it is constituted by a specification of a set of independent 
free parameters, identically distributed random variables X = {X1, X2, ..., Xm}. We use nota- 

plus rangom tions Xm and Xm when we want to identify with the subscript the cardinality of 
the represented set. We call sample the set of i.i.d. variables, and statistic any 
random variable that is a known and computable function of the sample or the 
function itself by abuse of notation. A specification of the statistic is the output 
of the function computed on a specification of the sample. If we plan to use a 
particular statistic in place of an unknown parameter, we call it an estimator. 
4 Actually the second option has a non null probability because of discretization of U in 


real computations. 
Sor a combination of them for X following a mixed distribution law. 


The predictive approach: a string of bits partitioned by a cursor 61 


Fig. 2.3: The sampling mechanism for: (a) continuous, and (b) discrete random vari- 
ables. 


In line with the above abuse and in view of gaining explanatory evidence we will 
sometimes (almost always when we deal with estimators) use the term statistic 
to denote quantities that are specifications of a statistic. In this case too the 
convention on the use of upper and lower case letters will distinguish a random 
variable representing a statistic from its specifications. A sampling mechanism 
Mx = (U, go) is a pair constituted by a uniform random variable U ranging in 
the interval [0,1] (unitary henceforth) plus any mapping gg : [0,1] + R such 
that Fy,(u) = Fx. Finally, go is called the explaining function of the sampling 
mechanism. Within this notation the outcome space Dx is generally denoted 
as the sample space from which the X samples are drawn for a given 6, while 
the union of the Dx corresponding to all © specifications is called the support 
X of X. These sets are also called the sample set Dx = D¥ and the sample 
support ¥™ when we refer to the random variable Xm. 


Remark 2.1. For any sampling mechanism .W = (U, gg) generating X, variously 
splitting go into a composition gj, (g5,) we may figure many variants of sampling 
mechanism .@/' = (Z,gj,) where Z is generated by (U,gj,). Their suitability 
depends on what we have for grant, resumed in 62, and what me must elaborate, 
gathered in 6;. Note that the seeds equiprobability is piped to both the Z 
and the X specifications. Namely each u; has the same density probability 
(probability given the discretization operated by the device computing it) of 
being generated. This means that the the expected number of u;s below a given 
threshold ¢ € [0,1] within {u1,..., Um} is ~ t- m and the frequency of such u;s 
tends to t with m — +00. Analogously, the numbers of corresponding z;s and 
xis below thresholds z and x, respectively, are expected to to be = Fz(z)-m 
and = Fx (a#)-m, and the frequencies tend to Fz(z) and Fx (x) with m — +00. 
What changes is the spread of the elements on the real axis: uniform in [0, 1] 


You compute 

_ statistics for 
inferring about 
random 
parameters 


Nothing else 
convincing 
explanations, 


hence compatible 
with what you 
observed. 


62 Algorithmic inference 


for the former, variously concentrated for the latters, exactly in order to satisfy 
the above conditions. This is why: i) on one side each sample element has the 
same weight 1/m both in the sample expectation operator and in the empirical 
c.d.f., and ii) we assume a sample {y1,.-., Ym} correct if {Fy (y1), ..., Fy (Ym)} 
behaves like a sample from uniform distribution. 


Remark 2.2. You recognize that h(X1, Xo,...,Xm) is a statistic if h can be 
run by your computer obtaining an output on any set {21, £2, ..., £m} without 
requesting values for free parameters. There exists a unique function gg for 
explaining samples of a continuous random variable through (1.66), while an 
infinite set of isomorphic functions is available for discrete variables. It is pre- 
cisely the parameter 6, rather the related random variable ©, the property I 
constituting the main object of the inference methods in this book. The func- 
tion gg, with 0 free parameter, sums up everything we already know about X, 
actually nothing when 6 coincides with the set of the delimiters in Fig. 1.13(b). 
The companion properties 7 are statistics computed on the sample. Henceforth 
we will often rely simply on the difference between capital and small letters 
denoting symbols to distinguish samples and statistics from their specifications. 


2.1.3 In search of compatible populations 


We want to capture relevant features of populations which are compatible with 
the sample we have observed in the sense they may generate a sample exhibiting 
the same value of the statistics which we appreciate in the actually observed 
sample. The paradigmatic way of reasoning may be this: at the third apple 
you taste from a basket and find bad you cannot decree that the whole basket is 
bad; in any case you prefer to pick the next apple from another heap. In other 
words, from among various explanations you may find for your experiment, you 
appraise less consistent with it the fact that the rest of the basket is made up of 
wonderful fruits. 

We generate a sample {x1,...,2m} ofa Bernoulli variable X with parameter 
p, at the root of the above reasoning, through a sampling mechanism Myx = 
(U, gp) with explaining function: 


1 ifu<p 


0 otherwise (2.1) 


Jp(u) = { 
Forget about the problem of how to draw {u1,..., Um} (tossing a sequence 
of dice, using a random number generator — see Secs. 1.3.6 and B.2.1). As a 
matter of fact you do not know the w’s. You only observe {21,...,@m}. Rather, 
as the order with which to consider the random bits is meaningless under the 
hypothesis of their independence, what you really observe is the number k of 
1 within the sample. Looking for compatible populations with the observed 


The predictive approach: a string of bits partitioned by a cursor 63 


AU,X 
1 pe + oo o + ooe + + oo o o 
Pe ioe Ped eee aoe | 8 
0 m 


ohio hiolifololololi foi Hifolololifolol ofol i Toh TiohiTol iol 


Fig. 2.4: Generating a Bernoullian sample. Horizontal axis: index of the U realizations; 
vertical axis: both U (lines) and X (bullets) values. The threshold line p realizes a 
mapping from U to X through (2.1). Below strip: produced bits. 


k, we may rely on past and future observation histories representation as in 
Fig. 2.1, i.e. through a string of bits X that we partition into sample (the 
string prefix) and population (the suffix). All these data share the feature of 
being independent observations of the same phenomenon, hence derived by the 
explaining function (2.1) with same parameter p. 


As can be seen from Fig. 2.4, for a given sequence of U specifications we 
obtain different binary strings with different values of k constituting the statistic 
sp = |; zi that we observe in the sample depending on the height p of the 
threshold line. 


Note that in turn p represents also the counterpart of k, in terms of frequen- 
cies referred to an infinite population, i.e. the limit of the 1 frequency in the 
sample suffix when its length M goes to infinity. Hence shifting to the m+ M 
uniform random variables, we may wonder with which p a sampling history 
may continue when the first m observations counted exactly k bits set to 1. 
Per se any suffix of M uniform specifications may follow with equal probability 
the former m specifications generating the statistic k. On each specification of 
the suffix we have a specification p of a random variable P as well. Thanks to 
(2.1) however, the probability with which P takes its values on the suffixes is 
not uniform. To get visual evidence of it, in Fig. 2.5(a) we consider a string 
of 30 + 200 unitary uniform variables representing, respectively, the random 
seeds of a sample and a population of Bernoulli variables. Then according to 
the explaining function (2.1) we compute a set of sequences of Bernoullian 230 
bits long vectors with p rising from 0 to 1. Namely, each stepwise line in Fig. 
2.5(a) represents a trajectory described by the point of coordinates ¢ = k/30 
and w = r/200, computing the frequency of 1 in the sample and in the popu- 
lation respectively, for a given specification of the random seeds. Let us repeat 
this experiment 32 times (a small number for the mere sake of easy visualiza- 
tion), each time tossing a different vector of uniform variables. On the vertical 


Raise p, raise s 


but which p 
actually underlies 
sp? 


Simply unfolding 
the sampling 
mechanism 


you have pto bea 
specification of 
random variable P 


that you capture 
by crossing sp 
with all possible 


observation 
vicissitudes. 


Uniform seeds 
move into the 

_ non uniform 
distribution of P 


An always valid 
mechanical 
reasoning on 


64 Algorithmic inference 
. . T 
0.8 —ce 0.8 aw 
0.6 0.6 ah 
P = 0.409, ğa =e 
m 
P = 0.2052 0.2 
0.2 0.4 0.6 0.8 2 == 0.2 0.4 0.6 0.8 bs 
k/30 = 0.30 0) k/30 = 0.30 oO) 
(a) (b) 


Fig. 2.5: Twisting sample with population frequencies along sampling mechanism 
trajectories. Population and sample sizes: M = 200 and m = 30 elements, respectively. 
$ = k/m = frequency of ones in the sample; y = r/M = frequency of ones in the 
population. 

Stepwise lines: (a) trajectories described by the number of 1 in sample and population 
when p ranges from 0 to 1, for different random seeds, and (b) tracks subset having 
0.2— 0.01 < Ņ% < 0.2+0.01 when k = 9. Thick lines mark the coordinates of a possible 
twisting argument pivot (see Sec. 2.1.5). 


section in correspondence to abscissa k (k/30 after normalization) we have the 
distribution of the p’s compatible with sp = k. For instance we may simply 
approximate the probability that 0.2 — 0.01 < P < 0.2 + 0.01 when k = 9 
through the frequency of {u1,..., Um+m} tracks like in Fig. 2.5(b) that cross 
within this gate the section k = 9. By fitting a larger number N of curves 
in the same section we draw the histogram in Fig. 2.6 that denotes a smooth 
yet asymmetric distribution law of P whose analytical shape we will discuss 
later. The curve is centered around the value 0.315 ê with a limited variance. 
There is no limit to the accuracy with which we may determine the parameter 
distribution law. It is just a matter of drawing very many (¢, Y) trajectories 
(actually we used 5000 such curves), counting the frequencies yw on very long 
suffixes (or definitely referring to p), and for common applications the random 
number generator available in the operating system of a personal computer (e.g. 
the rand() routine in a standard linux distribution) is suitable.”. 

We may extend the method to many other inference instances, e.g. for the 
parameter A of a negative exponential distribution on the basis of the statistic 
sa = do, vi (see Fig. 2.7(a),(b)), or the extreme A of a variable uniformly 
distributed in [0, A] on the basis of the statistic s4 = max{21,...,%m} (see Fig. 
2.7(c),(d)). We may easily realize, as a weakest condition, that all we require of 


9 9+1 
30+17 3041 
7 As well known, the quality of these generators depends on the independence of the gen- 


erated numbers, i.e. on the one way facility of the algorithm computing them [Levin, 1987]. 


6 


a number included within the interval ( ) as will be shown later. 


The predictive approach: a string of bits partitioned by a cursor 65 


Fig. 2.6: Histogram @¢p describing the distribution law of P when sp = 9. 


the statistic s, as a function h of a sample specification {x1,...,2%m}, to derive 
the parameter © distribution law, is the satisfaction of the three statements in 
the following definition. 


Definition 2.2. A statistic S = h(X1,...,Xm) with specifications on © is well 
behaving w.r.t. a parameter © if: 


1. a uniformly monotone relation exists between s and 6 for any fixed seed, 
like the ones in Figs. 2.5, 2.7(a,c) — so as to have a unique cross of the a uniform 
‘ ‘ : $ ; cause-effect 
seed trajectories with the s vertical line, relation 
2. on each observed s the statistic is well defined for every value of 0, i.e. 
any sample specification {21,...,@%m} E€ X” such that h(x1,..., £m) =S without holes in 
.1: g z š 4 , the sample space. 
has probability (density) different from 0 — so as to avoid considering 
non surjective mapping from ¥™ to ©, i.e. associating via s to a sample 
{@1,...,U%m} a 0 that could not generate the sample itself, and 


3. the crossing points {(61,5),...,(@nv,s)} constitute a true © sample for resulting in a true 
the observed s. As 0; = p(s,uj,--.,uj,) is a solution of the equation sampling Se 


? m 
s = h(go(u1),...,golum)) with the seed {u,...,ud,}, this requires the "Pe Parameter 
conditional joint distribution of {U1, ..., Um|S = s} to be independent of 
any (hidden) parameter — so as to avoid any bias in the © histogram. In 
turn, it results in the same requirement on {X1,...,Xm|s}, since z; = 
go(ui). 


| 
Example 2.1. 
e In the case of [0, A] uniform variable with obvious explaining function 
galu) = au (2.2) 


the statistic S4 = max{X1,...,Xm} is well behaving. On the contrary 
the statistic S’, = 50", X; misses the second requirement. For instance, 


66 Algorithmic inference 


for A = a the observed sample {a/2,a/2, a/2} gives s’, = 3/2a, but sample 
specification {3/2a, 0,0} giving rise to the same statistic has a probability 
density equal to 0. Thus we cannot say that the trajectory crossings of 
the line s’, = a are representative of the parameters compatible with any 
sample of three elements having sum equal to a. 


e In the case of the negative exponential distribution law the explaining 
function (see Example 1.10): 


glu) = —2* (2.3) 
denotes that the above requirements are satisfied by the statistic Sa = 
yo. X; defined earlier in order to infer the parameter A (having spec- 
ification à). Vice versa the statistic S4 = max{X1,...,Xm} misses the 
third requirement. Indeed for a given s^) you have 


A™e- Deci Ti 


ce ee A = mO e rhm Ae h (2.4) 


The problem now is to reduce the computing time of the above procedure. 
As described in the next sections, we may use two fundamental steps to do this. 


2.1.4 Bootstrapping populations 


Starting from the sample {x1,..., &m } we may standardize the procedure shown 

in Fig. 2.5(b) as follows. We compute pivoting statistics — such as sp for the 

Bernoullian variable or sq for the exponential one, s4 for the uniform etc. — 

Welcome on then we sample O realizations compatible with them in a number N as large as 
board if you have z Sos £ . a 

aseedto We want in order to compute an empirical distribution of © as close to the true 

compute $ one as we like (see Definition 1.21). We achieve this by extracting N samples 


of size m from the uniform variable U constituting a set of seeds. On each seed 


{uj,...,u*,} we solve the master equation in 0: 
s=h(0,ut,...,u*,) (2.5) 


where s is the observed pivotal statistic and h the function connecting seeds to 
the parameter 6 through the sampling mechanism (U, gg) associating the single 
x; to 0. In general terms, 0 may be a vector of parameters, s and h vectors as 
well. 

For instance, for the negative exponential variable with parameter A (2.5) 


reads: 7 ian 
= — tu (2.6) 


while for the Bernoulli variable with parameter p the equation is: 


sp =) Ip gu} (2.7) 
i=1 


The predictive approach: a string of bits partitioned by a cursor 67 


Algorithm 2.1 Generating parameter populations through bootstrap. 
1. Identify a well behaving statistic S for the parameter 0 and its master 
equation; 


2. compute a specification se from a sample of size m; 
3. repeat for a satisfactory number N of iterations: 


(a) draw a sample u; of m uniform random variables; 


(b) get 6; = Inv(so,u;) as a solution of (2.5) in 6 with s = so and 
(Uiga thy, ) Sang 


(c) add 6; to © population. 


Of course the last equation has no unique solution; rather, there is an interval 
of p values solving (2.7). This is part of the fact that when you count the number 
of trajectories crossing a vertical line (say for k/30 = 0.3) below a fixed w in 
Fig. 2.5(a) you have no way of deciding whether you must consider the upper 
or the lower corner of the generic trajectory step. This gives rise to a definite 
indeterminacy on the c.d.f. value which we will describe analytically in the next 
section. 

The O empirical c.d.f. reads 


N 
Fo(0) = So osl) (2.8) 
J=] 

Note that the empirical c.d.f. accuracy increases in our approach with the 
number N of U samples (while some indeterminacies remain with discrete X), 
where the extraction of U specifications is just a matter of the computer time we 
are allowed to spend. We will call this procedure p-bootstrap when we want to 
highlight the fact that it generates population parameters in place of replicating 
samples like in the original version of this method. 8 


Example 2.2. A general procedure for generating parameter populations may be 
the one described in Algorithm 2.1. You may easily realize that we obtain the 
curve in Fig. 2.7(a) computing the empirical distribution (2.8) on the population 


8Efron and Tibshirani’s basic idea [Efron and Tibshirani, 1993] of stressing sample informa- 
tion is to generate sample replicas {x},...,a7%,} by extracting elements of the original sample 
with replacement and to consider the distribution of relevant statistics over the replicas. Hence 
we extract N bootstrap samples of size m, i.e. the above replicas with this algorithm (or vari- 
ants such as in Bayesian bootstrap [Lo, 1988]) and compute a set of specifications 95 each 
estimating 0 through the statistic applied to the j-th replica. Finally, we get an approximate 
c.d.f. of O through: 

N 


PS0) = > TI —20,0)(63) (2.9) 
j=l 


where to improve X empirical c.d.f. accuracy we need to increase the sample size m. 


If the solution is 
not always 


univocal, this is 
not a drawback 


A huge set of 
specifications, 
hence a 
population, hence 
a c.d.f. 


A general 
purpose 

bootstrap 
procedure 


In search of well 
behaving 


statistics 


68 


Algorithmic inference 


Fig. 2.7: (a) Trajectories described by the statistic sa for a negative exponential 
distribution, and (b) corresponding histogram approximating the distribution of the 
related random variable A when sa = 2 as highlighted by the vertical line in (a); (c) 
and (d) report the analogous experiments for the statistic s4 describing the parameter 
A of a [0, A] uniform distribution. 


obtained through Algorithm 2.1 when Inv(sq, ui) = >) ;(—In(ui);)/sa, and the 


curve in Fig. 2.7(c) when Inv(sa, ui) = s4/max;{(u;),} °. 
E 


The ways of identifying sampling mechanisms and bootstrapping populations 
may go variously out of box, like in the two following examples. 


Example 2.3. Consider the symmetric exponential distribution characterized by 
the density 


fx(z; à) = soe (2.10) 


°Note that using Inv(s’,, wi) = s',/ 0; (ui); as the statistic S% = D7 X; would suggest 


may give rise to solutions a; less than the maximum of the observed x;s which represent 
unfeasible solutions. 


The predictive approach: a string of bits partitioned by a cursor 69 


z Fa 
0.5 1 
0.4 0.8 
0.3 0.6 
0.2 0.4 
0.1 0.2 
2 4 6 8 10 12 <a 5 6 7 
fx A 
(a) (b) 


Fig. 2.8: (a) Plot of the density function for the symmetric exponential distribution 
with parameter \ = 6. (b) Empirical c.d.f. of the corresponding random variable A 
identified by a sample of size m = 7. 


defined for every x € R and \ € Rt and whose graph for A = 2 is illustrated in 
Fig. 2.8(a). 
Simple calculations suggest that the median T of the sample is a relevant sometimes the 
eve k median, with 
statistic. Namely, the sum of the absolute values of the difference between cOme mation 
sample elements and an arbitrary value is minimized when this value is precisely 
the median. Thus we expect this statistic to follow closely the A variations via 


the sampling mechanism 


à + In(2u) if u < 0.5 
= 2.11 
galu) a ifu>0.5 tt) 
Thus, given the monotony of gà with u, (2.5) reads 
a A + In(2ŭ) if ù < 0.5 (2.12) 
àA—ln(2(1—ŭ)) ifù>0.5 
where ù is the median within the sample {u1,..., Um} that we define to be the 


(m + 1)/2-th value of the sorted sample {u(1), - - -, U(m)} if m is odd, and any 
value between U(m/2) and U(m/241) if m is even (a typical choice is in this case the 
arithmetic mean of these values). Equation (2.12) shows that 7 satisfies the first 
two properties of a well behaving statistic; the third one is only approximately 
satisfied given the analogy of |a| and a? functions and the coincidence of mean 
with median of a symmetric population. Hence, from a set of N seeds we sample 
N values of A* that are used to approximate the A c.d.f. via (2.8). In particular 
from the sample {5.71, 7.13, 5.98, 5.89, 5.92, 9.66, 7.15} we obtain the c.d.f. as in 
Fig. 2.8(b). 


sometimes it_is 
the output of a 
numerical 
procedure. 


A well behaving 
procedure 


that is accurate 
and not 
expensive. 


Indeterminate 
solution — 
bounding 
distributions 


70 


Algorithmic inference 


Algorithm 2.2 Generating a hypergeometric random variable. 


1. Draw N uniform values {u1,...,un}; 


2. build the sets R = { (1, u;), i = 1,..., K}, W = {(0, u;i) i = K+1,...,N}, 
and B= RU W; 


3. sort the pairs in B according to the nondecreasing order of their second 
component; 


4. build the set {x1,.. 
ordered pairs of B. 


.; Um} consisting of the first component of the first m 


Example 2.4. Consider the experiment of drawing a ball from an urn contain- 
ing N balls, K '° of which are red while the remaining ones are white. If 
each drawn ball is not put back in the urn, the number h of red ones over m 
drawn balls is described by a hypergeometric random variable, synthesizing the 
hypergeometric model dealt with in Sec. 1.2.3. Namely!?!: 
Ta 

(a) 

m 

Let us fix a sampling mechanism for a set of m extractions. Denoting 
x; = 1 if the i-th drawn ball is red and x; = 0 otherwise, a joint specification 
{x1,...,%m} can be obtained from a uniform source {u1,..., uy} according 
to Algorithm 2.2. A specification h of the hypergeometric random variable H 
registers the sum of the values x; obtained above. Aiming to compute the pa- 
rameter K distribution law, we note that raising this parameter, say from K 
till K, will never decrease the number of underlying 1’s in the observed sample. 

Therefore we can solve the master equation (2.5) as follows: repeat the four 
steps of the above procedure on U’s replicas with a variant in step 2 that consists 
in shifting the value of K from 0 to N until the sum of x;s computed at the 
end of step 4 equals the observed h. In this way you collect a sample of K’s 
necessary to approximate its distribution law through (2.8). From a technical 
viewpoint, note that shifting K by one unit requires deleting the first element 
of W, appending it to R and changing the label to the corresponding element 
in the sorted B. Thus it is a matter of a linear procedure feasible with any 
computer. Moreover, (2.5) is solved also by the last K such that the sum of 
xis equals h. Hence we obtain a pair of cumulative distributions, like in Fig. 
2.9 gauging the distribution of K. Once again the boiling up statistics is ` a; 
satisfying the first two requirements for a well behaving statistics and the third 
one only approximately with a similar behavior of the same statistic drawn from 
a Bernoulli variable. " 


fu(h;m, N, K) = (2.13) 


10Note that here we can not distinguish random variable and specifications associated to 
the parameter K for typographical reasons. 
11Remember thaf for uniformity sake we often confuse the notation Px with fx. 


The predictive approach: a string of bits partitioned by a cursor 71 


10 20 30 40 50 k 


Fig. 2.9: Bounds on the c.d.f. of the parameter K for a hypergeometric distribution 
for N = 50, starting from the statistic h = 2 observed in a sample of size m = 10. 


2.1.5 Deriving analytical expressions 


In the case where we are lucky, we may analytically link the distribution law of 
a parameter to laws of other variables whose distribution is in turn affected by 
the parameter. Let’s go back to Fig. 2.4. In particular u’s and p in the picture 
explain sample and population in the string underlying the graph. However, 
for each sample {21,..., £m}, hence for each fixed — though unknown — p and 
{u1,..., Um} in (2.1), and for whatever number k of 1 observed in the sample 
and any other number p € [0,1] we can easily derive the following implication 
chain: 

(kp > k) = (p < P) = (kp 2 k +1) (2.14) 


where kz counts the analogous number of 1 if the threshold in the explaining 
function (2.1) switches to p for the same specifications {u1,...,Um} of U. The 
asymmetry derives from the fact that: 


e raising the threshold parameter in gp cannot decrease the number of 1 in 
the observed sample, but 


e we can recognize that such a raising occurred only if we really see a number 
of 1 in the sample greater than k. 


Extending (2.14) to any string of m+ M specifications of U, we may read the 
three conditions in this relation in terms of events of the [0,1]"+™ sample space 
jointly spanned by the m + M uniform U’s for fixed k and p. This amounts to 
considering analogous set relations between the three events (Kj > k), (P < p) 
and (Ky > k +1), namely: 


(Kp > k) 2 (P < P) 2 (Kp>k+1) (2.15) 


In words, the implications in (2.14) say that a raising of the threshold line which 
increments the number of 1 in the population cannot decrease the number of 1 
in the sample and vice versa. Since these implications hold for any string of u’s, 
we have (2.15). 


When simple 
logic arguments 
are enough 


For any fixed 
seed 


a logical relation; 


summing up over 
all seeds we get 


an events’ 
relation. 


72 Algorithmic inference 


For instance, let us focus on the cross formed by the vertical line ¢ = 0.3 
and the horizontal line ~ = 0.4 (the plain black lines in Fig. 2.5(a)) as the 
It is a simple pivot (k,p) of our implications. It is clear that any trajectory crossing the line 
ibe $ = 0.3 at any ordinate Y = p < 0.4, say w = 0.2 (the dashed line in the figure), 
will meet the line Yy = p = 0.4 at an abscissa ¢ = kg /30 > k/30 = 0.3. These 
random trajectories are exactly what commutes the randomness of the abscissa 
into the randomness of the ordinate. In other words, trajectories having a p 
less than p are all and only those having a k > k, apart from some uncertainty 
about those having k exactly equal to k. 
allowing us to Hence, passing to the probabilities P, i.e. to the asymptotic frequencies on 


identify the c.d.f. 
vn’ stp. the above trajectories, we have the consequent bounds: 


P (Kp > k) > P (P <p) = Fp (F) > P(Kp> hk +1) (2.16) 


which characterize the c.d.f. Fp of the parameter P when the statistic sp = 
yo vi equals k, as no assumption has been introduced against the continuity 
of P }?, 

A universal The asymmetry disappears if we are working with a continuous random 
variable. For instance, for X negative exponential random variable of parameter 
A, we have 


Dais E alu) = Sw) (2.17) 


(see Example 1.10). The analogous logical relation of (2.14) is the following: 
(à < a) s (>. TEDD zi) (2.18) 
i=1 i=1 


where %; is the value into which u; would map substituting À to A in (2.17). It 
says that, for a fixed set {u1,..., Um}, thanks to (2.17) a rise in the parameter 
A cannot decrease the value of the single x;, hence of the corresponding statistic 
and vice versa. The consequent probability transfer is the following: 


F(A) = For x, (>. zi) (2.19) 


Getting definite To get an analytical expression for the parameters c.d.f.’s, we note that 
values to c.d.f.’s 


e since AZ follows a binomial distribution, (2.16) reads 


2 ("ra —p)"~* > Fp(p) = y ("ra =py* (2.20) 


i=sp i=sptl 


allowing us to keep Fp within the two curves in each frame of Fig. 2.10. 
This is the analytical counterpart of the multiple solutions admitted by 


12Summing up: P integrates probabilities in vo P asymptotically accounts frequencies 
over M bits with M — +00. 


The predictive approach: a string of bits partitioned by a cursor 73 


Fp Fp Fp 

1 1 1 
0.8 0.8 0.8 
0.6 0.6 0.6 
0.4 0.4 0.4 
0.2 _ 0.2 er 0.2 -a 

p p 
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 
(a) (b) (c) 

Fp Fp Fp 

1 1 1 
0.8 0.8 0.8 
0.6 0.6 0.6 

0.4 0.4 0.4 

0.2] 0.2 0.2 es 


SI 
-Pe 


(a) (e) (£) 


Fig. 2.10: Bounds on the c.d.f. of the parameter P for a Bernoulli distribution, com- 
patible with an observed sample {x1,...,¢%m} such that X ;-; £i = sp. (a) m = 10, 
sp = l; (b) m = 10, sp = 5; (c) m = 10, sp = 9; (d) m = 100, sp = 10; (e) m = 100, 
sp = 50; (f) m = 100, sp = 90. 


(2.7). This is also the asymptotic form of (1.10) for N and K going to 


infinity. 
e since the sum of exponential variables follows a Gamma distribution, the 
c.d.f. of A is ve e 
~ EN (Asa) exp (—Asq) 
Fa(à)=1— 5 ——r —— (2.21) 
i=0 


with sa = >, 2, as depicted in Fig. 2.11. 


Note that curves like these make no sense in the case where we wonder about 
the frequency of 1’s in a short prosecution of an actual sample. Indeed this 
quantity remains random, but the sum of the probability with which it takes its 
values is not equal to 1, as specified in Sec. 1.2.1, i.e. these values do not define 
a random variable. 
Convincing mathematicians that our approach to finding Fe is definitely Itis not a 
. ‘ š : ayesian 
different from Bayes’ approach is typically difficult. Both start from a sample approach 
{x1,...-,;2m}. In our view this is the sole information used to make the distribu- 
tion of © compatible with the sample, via the sampling mechanism, where © is 
the unknown parameter of Fx. Bayesian statistics need also the knowledge of 
an a priori distribution of © that is modified given (in light of) the observed 


p-boostsrap and 
twisting 
argument are the 
basic tools of 
algorithmic 
inference 


_ What is most 
interesting in a 
sample, what less 


74 Algorithmic inference 


(a) (b) (c) 


Fig. 2.11: c.d.f. for the parameter A for a negative exponential distribution, compatible 
with an observed sample {21,..., £m } such that DY vi = 8a. (a) m= 100, sq = 1.3; 
(b) m = 100, sa = 2.5; (c) m = 100, sa = 10. 


sample. In some cases the two approaches lead to almost the same Fo after 
assuming a uniform a priori distribution for the parameter over its definition 
range. This happens, for instance, for © representing the parameter p of a 
Bernoulli distribution, where Fe according to Bayes coincides with our lower 
bound. It is easy to see that a similar a priori distribution cannot be attributed 
to the parameter À of a negative exponential variable, as this parameter ranges 
between 0 and +00. 

We will refer to any expression similar to (2.14) and (2.18) as a twisting 
argument, since it allows us to exchange events on parameters with events on 
statistics. It constitutes the second main operational tool of the algorithmic 
inference framework introduced at the beginning of this chapter, whereas the 
first is p-bootstrap. For any sampling mechanism, gg is the pivot of the ex- 
changes of probabilities between sample and population properties. We may be 
interested directly in 0 or perhaps in other properties/parameters that represent 
some function h(@). Henceforth we will not distinguish between 0 and h(@) as 
a target of our inference whenever no ambiguity arises. 


2.1.6 Pivots and sufficient statistics 


As pointed out in Sec. 2.1.4, aiming to discover the distribution law of the 
extreme a of a random variable X uniform in [0,a] we can state the following 
two implications: 


(a<a) e (>: mL $a) (2.22) 


(ast) e (apax {a} < pax {3)) (2.28) 


gis) 


The two implications call for different distribution laws on the target parameter. 
This means that at least one of the two distributions is wrong, hence that at least 


The predictive approach: a string of bits partitioned by a cursor 75 


oe 


Fig. 2.12: Contours of the sample set partition determined by statistic S. 


one of the two implications in each case is ill-posed. The well behaveness of the 
statistic stated in Sec. 2.1.3 is a powerful tool for discovering this kind of error. A 
bit more stringent criterion is based on the notion of sufficient statistic. In Sec. 
1.1 we stated a protocol for drawing statistics from a sample consisting of the 
three rules, which we denoted consistency, uniformity and symmetry. Wanting 
to state properties m to twist with properties II on the entire population, we 
now fix a further rule for drawing statistics that encompasses symmetry and 
qualifies them as sufficient: 


e SUFFICIENCY RULE. The value of any function of observed data you will 
consider to twist with parameters of a population must not change if the 
statistic is applied to any sample having a same virtual probability (den- 
sity) of being observed. 


The rationale is that you cannot univocally translate a property you observed 
on a set of data into a property concerning the whole future, nor can you 
expect that what you observed is the “best representative” in any sense of the 
future. Rather you can decide by relying on properties you observed on the 
sample and had observed on any sample appearing with the same probability. 
The value of this probability does matter, since it changes with the unknown 
parameters. This set of samples is further extended through the notion of virtual 
equiprobability as will be explained later on. Thus on the one hand we consider 
sample properties that hit the structure of gg. On the other we use exactly 
this structure to reverse the value of the sample statistic into the value of the 
parameter determining it through gg, rather of the part of it we capture with 
the statistic. 

Let us consider the sample space of an experiment consisting in drawing 
a sample {X1, ..., Xm} from X through the sampling mechanism {U, go}. Ac- 
tually, we do not know the parameter of g, therefore we refer to the family 
{go;9 € O} that we denote go, and to the companion family 2 of related ran- 


No different 
sentences from 
samples we could 
meet with same 
probability 


A family of 
explaining 
functions, 


a companion set 
of s contours 


pushing s left to 
right with 
increasing 0 


without 
goosenecks 


i.e. a monotone 
flow of 
probability atoms 


along pipes 
determined by 
ge and ruled by 
7) 


76 


Algorithmic inference 


1 1 
$1 s2 83 


Fig. 2.13: Trajectory of probability atoms in the partition induced by a statistic. 
(1 — 2 — 3): monotone trajectory (by a sufficient statistic); (1 —> 2 — 4 — 5): 
gooseneck. 


dom variables. Then we draw the contours of the statistic S = h(X1,..., Xm) as 
in Fig. 2.12, i.e. the slices each constituted by the set of points x = {11,..., £m} 
such that h(z1,..., £m) = s for any s E€ Dg. Since the association of x’s to points 
in the picture is arbitrary, let us arrange the things so that slices move from left 
to right with increasing s. To give sense to the implication 

(6< 8) = (s72 s) (2.24) 
for each s, 0 and 9, we expect that for each x and its underlying u (i.e. z; = 
go(u;) for each i) an increase on 9, say till @, pushes h(gzlu1), .--, gglUüm)) to a 
value sz > s. This figures a monotonic motion of u also from left to right with 
0 in the picture accompanying the shifts of s it generates. Indeed we derive 
the u monotonicity from the sufficiency of S in the idea that no gooseneck is 
allowed in the u trajectories in Fig. 2.13 with 0 running from its minimum to 
maximum values (possibly both infinite). You may figure the sufficiency rule in 
this way: the unitary probability mass is located on the left of the first slice at 
a conventional motion time t = 0 when the parameter takes its minimum value. 
Let us locate this mass uniformly in the unitary hypercube in the m-dimensional 
hyperspace, and partition the former into probability atoms corresponding to 
elementary hypercubes of edge ô small enough. Let 0 grow monotonically with 
t and consider the process consisting in the migration of the probability atoms 
across the sample set slices along the pipes determined by go with varying 0. 
The sufficiency rule requires that at all times the atoms in the intercept of 
any pipe with a given slice are in a same number. Note that pipes may split 
or collapse across the sample space, but this happens exactly in the border 
between slices and independently of 0. Namely the piping layout depends on 


The predictive approach: a string of bits partitioned by a cursor 77 


ge, while the flow times along them of a single atom are ruled by 6. Moreover, 
we expect that under normal conditions (to be defined later) pipe biforcations 
are simultaneously flown by atoms. A gooseneck like the one along the sequence 
of nodes (1 > 2 — 4 — 5) in Fig. 2.13 would violate the rule at least on the 
tail of the u’s motion. 

Except for caveats that will be exhaustively considered in a moment, the 
grossly absurd point is the following. Consider the last time 6 an atom flows in 
a given slice. This means that after 6 no other atom can flow in this slice; ergo 
no pipe crossing this slice can come back to it through the gooseneck. 

Now, it is clear that O can have only one distribution law. Thus if any 
sufficient statistic is a candidate for stating a twisting argument (under the 
regularity conditions we will formalize in a moment) we will focus only on one to 
draw the distribution law. This is not contradictory if we consider a unique, let 
us call it minimal, statistic that is a function of all sufficient statistics. Therefore 
any correct distribution law based on a sufficient statistic maps bijectively into 
the one based on the minimal sufficient statistic. 


Example 2.5. Figure 2.14 shows the partition induced by the statistic 
h(X,...,Xm) = Yo, X; on the sample set of a Bernoulli variable, when 
m = 3 and p = 0.7. In Fig. 2.14(a), the u space has been discretized in atoms 
corresponding to elementary hypercubes of edge 6 = 0.1. Arrows highlight 
how some of these hypercubes move into the partition contour corresponding 
to h(x) = 3 if the distribution parameter is raised to p = 0.8. Figure 2.14(b) 
shows how the corresponding x sample space is partitioned by h. 


In order to obtain a formal enunciation of the theorem ruling this matter we 
need a slight extension of the definition of sufficient statistic, which coincides 
with the classical one, and formalize regularity conditions as follows. 


Definition 2.3. For a parameter set ©, and a family of explaining functions 
ge let X be the output random variable for a given 6, fx(.;@) its d.f., and F 
the family of density functions with 6 € ©. For a sample X € ¥”, a statistic 
S : X” +> ©, whose contours induce on ¥™ the partition U($), is said to be 
sufficient with reference to the parameter © if the ratio fx(x!;0)/fx(x?;@) 
does not depend on 0 when x! and x? belong to a same element of U(S); 

A sufficient statistic S is said to be minimal if whenever x!,x? € ¥ are such 
that fx(x!;0)/fx(x?; 0) does not depend on 0, they belong to a same element 
of U(S). 


Example 2.6. Consider a sample {X1,..., Xm} drawn from a Bernoulli distri- 
bution of mean p. Since 


with a common 
flow amount 
through pipes 
crossing a same s 
contour. 


The last atom 
cannot flow 
through a 
gooseneck 


One © 
distribution, one 
sufficient 
statistic: the 
minimal one 


The most 
elementary flow 


Sufficiency as 
insensitivity to 0 
within an s 
contour 


78 Algorithmic inference 


A blplel els ]¢] 


Fig. 2.14: A three-dimensional uniform sample space and its mapping to the sample 
set of a Bernoulli variable of mean p = 0.7. (a) Small cubes: discretized elements of 
(0, 1]?; gray regions: contours of the partition induced on the Bernoulli sample set by 
the minimal sufficient statistic S = h(X1, X2, X3) = Xı +X2 + X3; arrows: shift of the 
discretized elements from a contour to another when the Bernoulli mean is raised to 
p = 0.8; the gray arrow denotes an atom’s diagonal shift of disregardable measure. (b) 
Partition induced by h on the Bernoulli sample set, where each contour is associated 
to the gray shade used in (a). 


m 


maze Tl Tl Ge 
i=1 igl 1:4;=0 
pEi (1 pymEEim (2.25) 
if x! and x? are sample realizations such that X ;2; xz} = J>; x2, then 
fx(x')/fx(x?) = 1. Being this value independent of p, the statistic 


S(X1,...,Xm) = 32, Xi is sufficient with reference to P. Moreover, it is 
easy to verify from (2.25) that sample realizations whose elements sum up to a 
given value have the same probability of being observed. Thus S is a minimal 


sufficient statistic. 
E 


A more direct way of characterizing sufficiency is the following. It bases on the 
idea of mapping ¥ into a new space Y, possibly splitting single specification of 
X into a set of elements Y in Y. In such a way: i) the contours of the partition 
U(S) induced by S in the sample space Dx bijectively translate into contours 
of an analogous partition W'(S) in Dy, and ii) we may use the above splits 
of X specification to obtain replicas of X samples in Y in a number that is 
independent of 6 and is suitable for equalizing the probabilities of the elements 
in a same contour of U’(S). 


The predictive approach: a string of bits partitioned by a cursor 79 


0 


Fig. 2.15: An suspect couple of (s,@) trajectories. 


Lemma 2.1. A statistic S on ¥ is sufficient w.r.t. © if and only if ¥ maps 
on a set Y such that S induces on Dy a partition W(S) and contours of U(S) 
entirely map into contours of W(S). Moreover fy(y';@) = fy(y?;0) for any 0 
when y! and y” belong to a same element of W(S). S is minimal if whenever 
y',y’ € Dy are such that fy(y';0) = fy(y?;9), they belong to a same element 
of W'(S). 


Proof. (Sketch) As fx(x';0)/fx(x?;0) = p for any 0, we just split x! and x? 
in vı and vz points in Y such that 1/2 = p. E 


We speak of this property as virtual equiprobability of X in a same contour 
of U(S), as mentioned in the sufficiency rule. 


Example 2.7. Consider a sample X = {X1,..., Xm} drawn from a binomial 
distribution of parameters n € N and p € [0,1]. The statistic h(X1,..., Xm) = 
>, X; is sufficient with reference to P, as whenever xt, x” € ¥ = {0,...,n}™ 


are such that h(x!) = h(x?) it follows from simple computations that 


fx) — TEE, (3) aie 


is independent of p. To operationally translate this property into virtual 
equiprobability, consider the space Y = {0,1}'"" and the mapping ¢: + Y 
associating to x € ¥ a vector composed by m sub-vectors, each containing the 
binary expansion of the corresponding components of x (i.e. 0 is mapped into 
{0,0,...,0}, 1 is split into {1,0,0,...,0}, {0,1,0,...,0}, and so on). Coming 
back to the original sample realizations x! and x?, denoted y! and y? their 
transformations in Y through ¢ and recalling that 77") yj = D0)", y?, we can 


easily derive fy(y!) = fy(y7), from the computations of Example 2.6. 


i.e. virtual 
equiprobability 
within an s 
contour. 


„An order 
relationship in 
the joint space of 
both statistics 
and parameters 


Regularity in 
terms of: 


computability 


discretizability 


rankability. 


Flow of atomic 
probabilities for 
capturing 
twisting 
arguments 


80 Algorithmic inference 


The additional value of statistical sufficiency may be appreciated when we 
aim to state a twisting argument. In this case indeed, we want not only to reverse 
probabilities of random seeds into probabilities of parameter specifications — the 
true goal of bootstrap methods — but also to link an order relation existing on the 
parameter with an analogous relation on the statistic. This occurs for instance 
when no crossing (such as the one illustrated in Fig. 2.15) exists between (s, 0) 
trajectories as in Figs. 2.5 and 2.7(a,c) for (almost) every pair of seeds, or, still 
worse, pairs of seeds exist where s increases with 0 on one seed and decreases 
on the other. Properties like these are not always attainable. Thus to make 
our job easier we come to consider regular families of distributions defined as 
follows. 


Definition 2.4. A family 2 of random variables is regular if its explaining 
function go is such that: 


(a) go is computable for each 0 € De; 


(b) the related family of density functions ¥ is separable, that is a dense 
family in it exists!’. We obtain an element of this family by considering 
a partition U(O©) and replacing gg by a discrete function yg that: i) is 
parametrized in the index K of this partition, ii) has as input a uniform 
discretization Ay of U, whose elements we will call probability atoms, and 
iii) has as output a variable ranging over an enumerable set Ax C Dx for 
any K € U(O); 


(c) there exists a weak order relation < on X such that for each 01,02 E€ De 
with 0, < 02 we have go,(u) < go,(u) for each u € [0,1]. U moves from 
the minimum to the maximum of X according to this order relation with 
0 moving from its minimum to maximum. Namely: 


im golu) =infáa{x E X} Vue [0,1] (2.27) 
Vint 
A lim go(u) = supa{x E€ ¥} Vue [0,1] (2.28) 


where “y” is the universal quantifier (“for each”), Oin and Osup represent 
the infimum and the supremum values of the unknown parameter when 
it ranges within De (possibly both infinite), while inf4 and sup, denote 
the infimum and supremum operator based on the order relation <. 


Henceforth we will use yg for deploying our basic reasonings in a simple way and 
transfer their conclusions to go whenever it is correct. Rather we will dummily 
refer to go though formally consider yg. Thus we will speak of atomic input u to 
go understanding a point of coordinates u inside a probability atom cumulating 

13 Density property means the possibility of selecting for any arbitrary accuracy a countable 


subfamily in Fo C -¥ whose elements can approximate every function in .F with this accuracy 
[Rudin, 1974]. 


The predictive approach: a string of bits partitioned by a cursor 81 


on it the atom’s probability mass. Analogously we will confuse 0 with the 
index of the U(O) partition containing 0, and analogously for the increments 66 
that are discrete according to U(@). In this framework the class of statistics of 
interest to us are characterized as follows: 


Definition 2.5. Given areal number r > 0, astatistic S is a r-bounded increment 


(r-bi) statistic if for each increment 60 small enough, for each 6 € De almost 
every probability atom! of 
coordinates u = (u1,..., Um) moves from the partition element corresponding to 


S(go(u1),---;99(Um)) to the one corresponding to S(go+s(u1),---,9o+50(Um)) 
such that |.9(go+s59(u1),---,90+60(Um)) — S(go(u1),---,99(Um))| <r 1 


The mentioned statistics are the pivot of almost any twisting argument, 
according to the following theorem. 


Theorem 2.1. Let 2 be a regular family of random variables generated by a 
family go of explaining functions, Xm = {21,..., £m} a sample drawn from a 
variable within ¥ for a given 0 € De, and S = h(X,,) a statistic. 

The twisting argument: for almost every u € [0,1] such that go(u) = 
(go(u1),---,9o(Um)) = (@1,---,%m) and h(z1,...,2m) = 8 

(h(gg(u)) > s) = (8 < 8) = (h(gq(u)) > s +1) (2.29) 

for bounded l, can be stated 

e ifh is a function of an r-bi sufficient statistic for bounded r. 


e only if the distribution law of O is a function of a r-bi minimal sufficient 
statistic specification for bounded r. 


Proof. The claim of the theorem refers to a sufficient and a necessary condition, 
respectively, for the truth of two statements that are different though related. 
The former (IF condition) concerns properties of a statistic that allow it to 
constitute the pivot of a twisting argument for a parameter, hence a statistic 
on which the parameter distribution law will depend as well. The latter (ONLY 
IF condition) states an invariance property of this law: whatever pivot statistic 
will be used, the distribution law will result a function, through it, of a minimal 
sufficient statistic. 


IF 


Equation (2.29) realizes a monotone mapping from De to the range of S, 
namely for every u 


01 > 02 = h(go, (U)) Z h(go.(u)) (2.30) 


14That is, all atoms except for a few summing up to a disregardable probability. 

15This is a probabilistic version of the Lipschitzianity property of S w.r.t. 0. A function 
f: X |> R is Lipschitzian in A C R if a positive constant M exists such that | f(x) — f(y)| < 
M\x — y|Vz,y € A [Rudin, 1974]. 


A statistic with 
limited jumps 
with the 
parameter 


The mother 
theorem 


Conditions on the 
parameters based 
on sufficient 
statistics 


A monotone 
mapping drives 
the inference 


A graph for 
flowing 
probabilities 


No gooseneck if 
the statistic is 
sufficient 


just to maintain 
equiprobability in 
the statistic 
contours, 


82 Algorithmic inference 


Where a = b means a < b+r where r has a bounded absolute value. Concerning 
the atoms u migration, the regularity property of the considered family states 
that, when the value of 0 increases, all of them move with 0 from the minimum 
to the maximum value of y inside a Dy related to Dx as in Lemma 2.1 under 
the partial order relation: y’ < y” if and only if y; < yy’ for each i and yi < y” 
for at least one i'®. The trajectory of each u inside Dy produces a path in a 
directed acyclic graph G = (V, E) as follows. 


Definition 2.6. The path’s graph G = (V, E) is a directed acyclic graph, where 
a vertex v € V represents a sample y and an edge e € E from vı to v2 exists 
only if Ju, 0, d0 such that 


1. go(u) = yı lies in vı, 


2. go+ag(U) = y2 lies in v2 and 


3. if yı A y2, fu’, @ such that go4ao(u’) = y’ with yı < y’ < yo. 
a 


The last condition means that the discretization operated on De is sharp enough 
to ensure that an atom cannot jump a node, rather a d0 may push the atom 
ahead or not. The graph is fixed and determined by the functional form of the 
explaining function. For a fixed @ each node v contains those atoms u such that 
go(u) corresponds to the sample represented by v. 


Consider a sufficient statistic S and the partition U(S) induced by its contours 
on the sample space. We show that if we unfold the path’s graph in this parti- 
tion, all its edges are oriented in the statistic’s increasing direction — with the 
exception of possible contour permutations — otherwise the sufficiency does not 
hold. 

Suppose ab absurdo that there exists a sample y like vertex 2 in Fig. 2.13 
with one of its outcoming edges moving forward to vertex 3 and the other 
backward to vertex 4. It is really back oriented in the sense it ends in a contour 
already visited by y passing through vertex 7 preceding 2. Consider the time 
6* 17 at which a set of atoms {u;} reaches vertex 1. Thanks to the regularity 
conditions, the thresholds 7;(@) of the algorithm encoded in (1.71) all go close 
to 1 with increasing 6. Moreover, according to the third condition in Definition 
2.6, neighboring vertices in a path differ by only one element of the sample 
that is updated to the next value allowed by the discretization of X, except for 
diagonal atoms (i.e. atoms having equal values of some coordinates that sum 
up to a disregardable probability), such as the one highlighted by a gray arrow 
in Fig. 2.14(a). Thus, assuming 7;(0) continuous in 6, if we are considering the 
last crossing time of vertex 1, the thresholds 7;(@ + d0) are so close that we find 


16Where < denotes the strong order relation obtained from the relation < introduced in 
point (c) of Definition 2.4: a <b if and only if a < b and a Æ b. 

17as mentioned before we measure the time directly through the values of ©, given the 
monotonicity between the two parameters. 


The predictive approach: a string of bits partitioned by a cursor 83 


in 2 both atoms flowing in 3 and those flowing in 4. This means that at time 
0 + 2d an atom stays necessarily in 4, which is a contradiction since the last 
atom in / occurred at time 0. Therefore no atom stays in 1 at time 6 + 2d0 
and this violates the sufficiency condition (requiring the virtual equiprobability 
of the samples in a contour). Escaping the paradigmatic case in Fig. 2.13 may 
occur in three ways: 


- The backward edge reaches a contour never crossed before by the involved 
atoms. This means that a permutation in the contour ordering makes all 
edges forward-oriented. 


- The backward edge jumps more than one contour, thus forcing a contour 
before the one containing vertex 1 to have a probability greater than zero. 
This is still a contradiction if the statistic has bounded increments and 
the sequence of backward edges is long. Indeed, if the backward atom u 
lands in a contour much farther than the one containing vertex 1, then 
also the time of the last passage in this section will come much earlier than 
the one of 1 and then incompatible with the u landing time at the end 
of the backward trip. In these cases the twisting is made considering a 
latence zone around the passage times in order to absorb the short-range 
goosenecks of the atom trajectories, accounted for the bound / in (2.29). 


- Thresholds 7; are discontinuous functions of 0. This requires a jump on 
S in the rightmost event of the implication chain (2.29) to imply a jump 
on 0, still recovered by l. 


ONLY IF 


Suppose that (2.29) holds. It means that an r-bi sufficient statistic exists 
for O, actually the sample itself, for pivoting the twisting argument. Any r’-bi 
sufficient statistic, with r’ limited as well, can do the same leading in any case 
to the unique © distribution law. Then this law must be a function of any 
r-bi sufficient statistics, and therefore a function of the minimal r*-bi sufficient 
statistic, by definition with r* = max{r}, as an r-bi sufficient statistics is r*-bi 
sufficient as well. This statistic is unique, apart for bijective mappings on its 
values. Indeed, for any r-bi sufficient statistic S the minimal one S* is obtained 
by associating, through a 7(s), a same value to all contours having a same value 
of S. Suppose that another S’* exists with the same property. Then its contours 
must cross the $*’s contours; otherwise we have found a 7'(s) that, as a function 
of n(s) induces a more minimal statistics. This means that the crossing points 
have the same course of virtual probability density with 0 in respect to the other 
points in the same contour of S, and the same do in comparison to other points 
falling into S*’s contours (see Fig. 2.16). Then neither S* nor S” is minimal. 

oO 


An operational consequence of the above theorem is that we must take a 
great care in selecting the “appearence” on which to base our decisions. Having 


apart contours 
permutations, 


apart small 
latency zones, 


apart 
discontinuities in 
the sampling 
mechanism. 


Possibly many 
sufficient 
statistics, 


but 


unique 
distribution law 
of O, 


ergo a function 
of a minimal 
sufficient 
statistic. 


The most 
relevant sample 
properties are 


those 
representing 
sufficient 
statistics. 


Not always the 
sample mean is a 
relevant statistic 


84 Algorithmic inference 


Fig. 2.16: Contours of the sample set partition determined by two statistics S* and 
S’*, delimited respectively by gray and black lines. None of them might be minimal, 
since samples in the gray region have always same probability. 


observed a sample we can state many properties on it: all of them are correct if 
they are correctly computed; some of them are not useful since they cannot be 
used to state links with properties on the population, ergo they prove mislead- 
ing if improperly used for this task. Some properties are interesting, allowing a 
twisting argument. Sufficient statistics let us state a twisting argument useful 
for identifying the distribution law of the unknown parameter. The minimal 
sufficient statistic is exactly the property of the sample used to fix the distribu- 
tion law of this parameter!®. Thus sufficiency is a strong divide between good 
or poor statistics. 


Example 2.8. Consider the family -Z of the uniform distributions in [0, 0], with 
0 € [0, +00). The monotonicity of the explaining function with 6 (namely, 
golu) = Ou) gives rise to the uniformly oriented trajectories of the probability 
atoms as shown in Fig. 2.17. These trajectories cross a same slice contour of 
the statistic (Xı — h)? + (X2 — h)? twice, thus giving rise to loops that forbid 
the sufficiency of this statistic. Moreover, they fill the slices of the statistic 
Xı + Xə only partially. So in a same contour line we get both points (21, x2) 
falling inside the square delimiting the domain where the joint density is different 
from 0 and points falling outside, thus having probability density 0. None of 
these inconveniences occurs with the statistic max{X ,, X2}, which is actually 
sufficient for the parameter (see Example 2.11). 


Remark 2.3. Note that, of the two drawbacks mentioned in regard to the use of 
statistics X1 + X2 for capturing the extreme of a uniform distribution, only the 


18 An obvious corollary is that if a parameter admits a sufficient statistic, then it admits a 
minimal sufficient statistic as well [Fisher, 1925]. 


The predictive approach: a string of bits partitioned by a cursor 85 


Fig. 2.17: Atomic probability flow across different partitions induced by statistics for 
0 moving from 0.1 to 0.2, driven by a sample of the uniform distribution in [0, 6]. 
Straight line: a contour of the statistic h(X1, X2) = X1 + X2; circle: a contour of the 
statistic h(X1, X2) = (X1 — 0.03)? + (X2 — 0.03)?; corner: a contour of the statistic 
h(X1, X2) = max{Xi, X2}. Gray and white big squares denote the sample set for 
0 = 0.1 and 0 = 0.2, respectively. Small and large squares identify probability atoms 
for 0 = 0.1 and 0 = 0.2, respectively, arrows their motion. 


second one renders the statistic not well behaving (see Sec. 2.1.3). Goosenecks 
indeed do not disturb the population bootstrapping. 


Not all regular distribution laws admit suitable sufficient statistics. Theorem 
2.1 however does not exclude the possibility of actually basing a twisting argu- 
ment on non minimal sufficient statistics. This may happen, as in the Example 
below, when the sole sufficient statistics for a given parameter are trivially the 
sample itself and the order statistics!® (see Fact 2.2 later on), where the latter 
represent minimal sufficient statistics. In this case we may have a more syn- 
thetic statistic that: i) as a symmetric function of the sample is formally a 
function of the minimal sufficient statistic, but ii) as itself is neither a sufficient 
statistic nor a function of non trivial sufficient statistics. 


Example 2.9. Consider the symmetric exponential distribution introduced in 
Example 2.3. The joint density of a generic sample {X1,..., Xm} is 


1 m A 
aN = Xi lzi=Al (2.31) 
Generally the ratio fx(x!; \)/fx(x?;\) depends on A, apart trivial cases such 


as x! equals x? or a permutation of its components. Therefore we conjecture 


19 An order statistic is constituted by the list of the elements of x sorted in respect to an 
order relation, for instance for ascending values. The contour of this statistic is represented 
by all the permutations of a given sample specification. 


Sometimes just 
smart statistics 


Only some 
parameters admit 


non trivial 
sufficient 
statistics 


A broad pivot 


for a broad 
method. 


86 Algorithmic inference 


Fig. 2.18: Empirical c.d.f.s of the sample median from a symmetric exponential variable 
with different parameters A. T”: observed statistic; dashed lines: confidence thresholds 
around which to pivot A quantiles. 


that no sufficient statistic with reference to A exists, with the exception of 
the mentioned trivial ones (observed sample or order statistics). However, in 
Example 2.3 we noted that the median 7 of the sampled values, i.e. the central 
value of its ordered issue suggested by Fisher [Fisher, 1935] may be suitably 
used for computing the A distribution law. 

Moreover, from (2.12) we see that a monotone relation exists between pa- 
rameter À and statistic 7 for any set of random seeds {w1,..., tm}, so that we 
may state the twisting argument 


(r<7e (à < ) (2.32) 


The sole drawback is that we do not have an analytical form of the T dis- 
tribution law (being T the random variable whose specifications are T’s). 

We could however overcome the inconvenience with a direct method. 
Namely, on the basis of large samples of T generated by plugging random seeds 
in (2.12) we compute the empirical cumulative distribution of T for different 
values of À as in Fig. 2.18 and discover that there is a monotone shift of the 
curves with À. Thus, in order to find the quantiles of A a possible procedure is 
the following: 


1. Fix the observed 7* on the abscissa axis and draw a vertical line through 
it; 

2. fix the values 6/2 and 1 — 6/2 on the ordinate axis and draw a horizontal 
line through each point; 


3. pick the left extreme A; of the confidence interval at the cross of the vertical 
line with the former horizontal one by setting A; equal to the parameter 
of the curve passing through this point; 


4. pick the right extreme As of the confidence interval at the cross of the verti- 
cal line with the latter horizontal one by setting A; equal to the parameter 
of the curve passing through this point. 


The predictive approach: a string of bits partitioned by a cursor 87 


Of course, in this case bootstrapping populations as in Example 2.3 proves more 
efficient. 


| 
Still worse, a sample {X1,..., Xm} drawn from a binomial distribution of 
parameters p € [0,1] and n € N has probability function 
avy j Sh ge mn = n 
fxn, p) = p= (1 -p UE (1 — p] (z) (2.33) 


i=l 


so that no sufficient statistic emerges for n from the ratio 
fx(x';n, p)/fx(x?;n,p). In addition, no smart statistics appear available as 
in the previous case. 


Remark 2.4. In conclusion, it is not true that a twisting argument holds if and 
only if its pivot is a minimal sufficient statistic. However this is true for the 
standard parameters of many distribution laws we know as they belong to the 
class of either exponential or of bounded measure domain distributions whose 
properties will be shown in a moment. 


2.1.6.1 Facts of life of a sufficient statistic 


Sufficiency is a property specifically concerning the sampling mechanism of the 
observed points, hence their density or probability function. The following fact 
will clarify this aspect, at the same time suggesting an operational way for 
finding (possibly minimal) sufficient statistics. 


Fact 2.1. For any family X of random variables generated by a family go of Features of a 


explaining functions the following statements are equivalent: 
e a statistic S is sufficient w.r.t. O; 


e a statistic S induces a partition US) on Dx such that the ratio 
fx(x';0)/fx(x?; 0) does not depend on @ when x! and x? belong to a 
same element of U(S); 


o a statistic S on X induces a partition W(S) on Dy, where Dy is the 
support of a family Y of random variables into which 2 maps, such 
that: i) contours of U(S) entirely map into contours of W(S), and ii) 
fy(y';9) = fy(y?;9) for any 0 when y! and y? belong to a same element 
of W (S). 


e a statistic S is such that the joint density/probability function fx, of a 
sample given the value of the statistic is independent of 0; 


e a statistic S is such that the conditional distribution fris of T given the 
value of S is independent of 0 for every statistic T; 


sufficient statistic 


88 Algorithmic inference 


e a statistic S is such that the joint density function L(a1,...,2%mj;0) of a 
sample x = {x1,...,U%m} — denoted likelihood of the sample elsewhere — 
can split in the product of a function independent of 0 and another one 
dependent on x only through s (factorization criterion) [Neyman, 1935]. 
Namely: 


L(a1,..-,%mj0) = [| fx Goi 0) = n(@1,---, 2m) (5s, 4) (2.34) 
i=l 


Remark 2.5. The fourth statement of Fact 2.1 looks like an extension to any S 
specification of property 3 in Definition 2.2. Analogously, the second statement 
implies property 2 in the same definition. The monotony requested by property 1 
has been proven in Theorem 2.1, at least for regular distributions and within the 
limits of r-bounded statistics. Thus if a statistic is sufficient w.r.t. a parameter 
O, then it is well behaving as well, within the above conditions. On the contrary 
statistic sufficiency is not a strictly necessary request for devising a p-bootstrap 
procedure or for pivoting a twisting argument. 


Thus, starting from an explaining function go and aiming to discover a key 
way for observing a sample we must realize the local sensitivity of the density 
function to 0 via y in (2.34). Then for computing the parameter cumulative 
distribution we can come back to go, rather than to the inverse of this function. 
We may start from the very simple sufficient statistic constituted exactly by 
the values of the observed sample”? and go ahead looking for its minimality. 
As far the random variables considered in Appendix B, even if we are looking 
for standard functions of these variables, the following two statements almost 
exhaust the search. 


Fact 2.2. For any sample {X1,...,Xm} and any parameter © the specification 
{£1,.++,;2m} is a joint sufficient statistic. The sorted set {2(1),...,2(m)} is a 
sufficient statistic as well. 


Proof. Indeed Px,,....x,,(1,+-+;%m|X1 = %1,...,Xm = @m) = 1 and 
Pxa),.. Xm (T0) +++) L(m)|X1 = L1,---;Xm = Lm) = 1/m!, where m = m 
when all z;s are different each other, and a proper number less than m other- 
wise. O 


Standard Lemma 2.2. [Dynkin, 1951] For any family of probability distributions whose 
Statistics density functions can read as 


f(x; 0) = exp (c(0)d(@) + a(x) + b(0)) (2.35) 


where a, b, c and d are real-valued functions, the statistic X`; d(X;) is a 


i=1 
minimal sufficient statistic for O. 


20We speak of joint sufficient statistics when the statistic is a vector of statistics. 


The predictive approach: a string of bits partitioned by a cursor 89 


We omit the proof and observe instead that with such fọ the function y(s, 6) 
in (2.34) is expressed as exp (mb(@) + c(9) >", d(a;)). Thanks to its relevance 
w.r.t. the identification of sufficient statistics, we gather in the exponential fam- 
ily all the density/probability functions having the form (2.35). Many distribu- 
tions studied in Appendix B, as the Bernoulli or the negative exponential ones, 
belong to this family. Others, like the binomial of parameters n and p, belong 
to this family in regard to only one parameter — specifically, the latter. 


Example 2.10. Since the probability density of a negative exponential distribu- 
tion with parameter \ can be expressed as 


f(a; A) = Ae~*” = exp(—Ax + nd) (2.36) 


the related family of distributions is included in the exponential family for 
a(x) = 0, b(A) = ln à, c(A) = —A, and d(x) = z. 

Given a binomial distribution of parameters n € N and p € [0,1], consider 
the former fixed. Then the probability function assumes the form 


f(a;p) = Cra —p)"-* = exp (1 C) +zln i ar +nln(1 -»)) (2.37) 
so that the family of binomial distributions with a fixed value for n belongs to 
the exponential family for a(x) = In("), b(p) = nIn(1 — p), c(p) = In Ip and 
d(x) = z. 

Finally consider the following two families of Gaussian distributions 


e if the standard deviation ø is fixed, the probability density assumes the 
form 


f(a, u) = = exp (4) 


l > 1 Dog 
= exp (-m (v20) ar + pen su) (2.38) 


and then the related family of distributions belongs to the exponential 
family for a(x) = — ln (V270) — str, b(w) = -3 p?°, c(u) = p, and 
d(x) = +a; 


oO 


e if the mean yp is fixed, the probability density reads 
1 
f(a;0) = exp (- In (v 2mo) — za — n?) (2.39) 
o 


so that the related family of distributions belongs to the exponential family 
for a(x) = 0, b(o) = — ln (V270), c(o) = — 54, and d(x) = (x — p)?. 


The exponential 
family 


Exponential 
family includes a 


lot of | 
distributions 


Otherwise sample 


extremal values 
may be sufficient 
statistics 


Standard 
__ sufficient 
statistics again 


Many parameters 
need many 
equations 


90 Algorithmic inference 


The Gaussian distribution law falls in the exponential family even in an exten- 
sion we will see shortly considering both u and a free, while the hypergeometric 
and uniform distributions do not belong at all to this family. The latter distri- 
bution, as a member of the bounded measure domain distribution family, can 
be managed through the following Lemma. 


Lemma 2.3. [Kendall and Stuart, 1961] Given a sample {X1,...,Xm} with 
explaining function go such that Dx is the interval [a,6] for a € R, we can 
compute a sufficient statistic from it if and only if a factorization of the den- 
sity/probability function of the random variable Xm) = Maxj=1,..,.m(Xi) exists 
such that 
Fxo(®om) is independent of 0 (2.40) 
Fx4(£(m)) 
If this statement is true, then a minimal sufficient statistic for O is a function 
of X(m), where © is denoted truncation parameter. 
Analogously for Dx coinciding with the interval [0,b| for fixed b E€ R, if 


fx, (29) 


——— ~ is independent of 0 2.41 
1- Fes) e 


then a minimal sufficient statistic for O is a function of Xa = mMini=1,..m (Xi). 
Example 2.11. Consider the uniform distribution in [0,0], with 0 € Rt. Being 


fx, (x) = 41o, (x) and 


ifa <0 
ifoO<a<0é (2.42) 
ifx>0 


Fx, (x) = 


=. gs © 


the ratio fx,(x)/Fx,(x) amounts to 4 whenever 0 < x < 6. Thus X(m) is a 
minimal sufficient statistic for ©, as numerically confirmed by the experiment 


in Fig. 2.17. 
o 


2.1.6.2 From one to many parameters 


If we want to infer about more than one parameter we must enrich our sample. A 
clean idea about this comes from the bootstrap method and (2.5) in particular. 
If 0 is a vector then also s must be a vector. Thus we are working with a system 
of master equations, say in (61,...,6,), where we need a number of independent 
equations equal to v, hence v statistics, in order to have a definite solution: 


* * 
Si =S hallis egg Ueit ) 


(2.43) 


The predictive approach: a string of bits partitioned by a cursor 91 


Typical statistics may be moments of the random variable at hand. Yet, 
any well behaving statistics, possibly giving rise to independent equations, are 
welcome. As mentioned in Sec. 2.1.5, if these statistics induce suitable orderings 
on the parameter, (in particular if they are jointly sufficient), then they support 
a twisting algorithm as well. Dealing with the system (2.43) we may have some 
lucky conditions concerning: 


1. The separability of the equations w.r.t. the parameters. In the best case 
we may have equations where each relates one statistic to one parameter; 
otherwise, we must consider the distribution law of a parameter given 
other ones (for instance Fe, j9,,...,0, )- 


2. The independence of the seeds’ functions. Solutions {41,...,4)} of (2.43) 
each depend on the whole set of u’s, by the Symmetry rule (see Sec. 1.1). 
However they may depend through functions {g;(uj,...,u*,),¢ =1,...,v} 
that are partly or totally mutually independent. In the latter case, if the 
equations are also separable we may partition the inference problem into v 
independent inference problems. Otherwise we must start considering the 
joint distribution of the parameters. Then we may possibly be interested 
in the marginal distributions of the parameters or in the distribution of a 
function of them. 


Example 2.12. Consider a random variable X uniformly distributed between a 
and b. The factorization criterion highlights S4 = Xa) and Sg = X(m) (same 
notations as in Lemma 2.3) as joint sufficient statistics for the two parameters, 


since oa 
1 
L(a1, ..., Lm; a, b) = (=) Lab) (za) )L(a,) (£im)) (2.44) 


Thus the twisting arguments for learning the A when B is known and vice versa 
are the following?! : 


(sa <s) & (a<a) (2.45) 
Fapa) = [(1- — T(~c0,8.4] (@) (2.46) 
(sp <s) & (<?) (2.47) 
Fpa®) = 1- (==) E (0) (2.48) 


(see Example 2.29 for a formal derivation of the B distribution of A and B). Of 
course Xa) and Xm) are not independent, as X(1) < X(m). 

Jointly estimating both parameters, (2.45) and (2.47) give rise to the fol- 
lowing implication chain, with xq) the i-th element of the sample sorted in 
ascending order 

21 according to our notation, s4 and sp are the observed statistics, hence in correspondence 
of unknown parameters a and b, sq and s; are the values we would have observed if for the 


same seed A = @ and B=b. 


hence many 
statistics, 


better if the 
equations are 


simple 


and the statistics 
are independent. 


None of the , 
above conditions 
holds for a | 
uniform variable 


Things are even 
worse with 


. samma 
distribution 


92 Algorithmic inference 


em) > Tem) 2 Bm—1) 2 +++ 2 Ba) 2 Bay > Ba) 
& salsas <sp>G>ahb<b (2.49) 


The companion probability inequality reads 


1- (=) > Fas@ > (2—4) - (==) (2.50) 


b—-ŭ sg— à b- ù 


Right inequality directly comes from simple algebra, as 


P(A <ŭ,B >b) > (=) (2.51) 


b-@ 
Indeed, for any given a (2.51) reads as an equality, since the twisting argument 
on B has a double implication like in (2.47). Hence, since (A < @,B < b) is 


the complement of (A < ă, B > b) with respect to the sole B, F4, B(ă,b) = 


P(A < a,B > sg)— P(A < a,B > b). The leftmost term in (2.50) comes 


by complementation of (A < ŭ, B > b) with respect to the whole sample set. 
Namely: 


P(A <@,B <b) < P((A < ŭ, B > b)°) = 1-P(A < ŭ, B >3) <1 (2—4) 


b-a 
(2.52) 
As far is & from s4 as much the gap between the two complementations increases. 
a 


In some cases we are still able to find the joint sufficient statistics, but their 
distribution law is much intriguing. 


Example 2.13. Consider a sample {x,..., £m } from a Gamma distribution law 
with parameters n and A (see page 345). Factorizing its likelihood: 
via n—-1 nm 
= AVE ti A 
Lires im) = i nı) e 1 Tine (2.53) 


we get the joint sufficient statistics = {Sy = Jj", Xi Sa = 072, Xi} for N 
and A. Hence we may state separately the following twisting arguments for one 
parameter when the value of the other is known: 


(A < À) & (sx < sa) (2.54) 
23 mn—-1 Jsi en Asa 
Fa (à) = (1- 5 | To, 4%) (2.55) 
i=0 i 
(n <n) S (sn < 5%) (2.56) 
Fy), (7) = 1— Fs. (sn) (2.57) 


where Fs, is a Meijer G-function [Erdélyi et al., 1981]. Their joint distribution 
law is not trivial. m 


The predictive approach: a string of bits partitioned by a cursor 93 


The exponential family extends to the multiparametric distributions as follows. 


Definition 2.7. A probability distribution belongs to the multiparametric expo- 
nential family if its density function reads: 


k 


folz) = exp | X` ¢;(@)d;(x) + a(x) + 0(@) (2.58) 


j=1 


for a given k € N, where © is a set of parameters and a, b, cj and d; are 
real-valued functions. 


Lemma 2.2 extends as well to this family as follows: 


Lemma 2.4. For any family of distributions whose density functions belong to 
the multiparametric exponential family, with reference to the notation in (2.58), 
the set of statistics {X ;-; dj(Xi),j =1,...,k} is a set of joint sufficient statis- 
tics for ©. 


Example 2.14. Consider once more the family of Gaussian distributions having 
arbitrary values for the parameters u and ø. From (2.38) it is easy to see how 
this family belongs to the multiparametric exponential family for a(x) = 0, 
b(u, 0) = —In( 2no) — Eo cı(u, o) = -75, c2(u,o) = 4, di(x) = x”, and 
dg(x) = x. Therefore, from Lemma 2.4 a set of joint sufficient statistics for M 
and ¥ is given by {X1 Xi, Dey XP}. 


We will deal with more complex relations between parameters in the next 
chapter when we discuss computational learning. In the case of independent 
parameters a special debt is paid by the sample to the fact that the statistic is 
computed in view of inferring a given parameter, but with no knowledge about 
the others (which we will call hidden) as occurs for the mean and variance of 
a Gaussian distribution law. The sufficiency condition does not suffer from an 
enlargement of the parameter set: the fact is that all points in a contour of a 
sufficient statistic have the same virtual probability at each assignment of both 
target and hidden parameters. So if we checked equiprobability with the tar- 
get parameter, this property is maintained if we specify the hidden ones too. 
Minimality may suffer some chimeric effect: we may have no pair of contours 
with the same probability profile for a fixed assignment to the hidden parame- 
ters; but two contours may have the same probability profile for different values 
of the hidden parameters and this may break the univocity of the parameter 
distribution law. 


Example 2.15. Consider a Gaussian random variable, whose distribution law 
of both the mean p and the variance go? we want to compute. Let us define the 


Still exponential 
family but 


dependent on 
many parameters 


Joint standard 
sufficient 
statistics 


Hidden 
parameters 


and chimeric 
effects. 


Degrees of 
freedom 
shortening 


94 Algorithmic inference 


RK 
502 
105 


Fig. 2.19: (a) A contour of the statistic Rs = )73_,(Xi—X)? corresponding to rs = 2; 
(b) the points of this contour satisfying sm = 0. 


Normal variable Z as a Gaussian variable with mean = 0 and variance = 1. 
Starting from seeds constituted by specifications z of this variable we have the 
following variant of sampling mechanism (see Remark 2.1) for any X ??: 


Ti = H + 2i (2.59) 


For sm = }; t; and ry = >," | (x; — 7)’, the twisting arguments on the two 


individually considered parameters read: 


wih & sus SF (2.60) 
a<o S$ rs <r (2.61) 


Now the contour regions of Ry do not satisfy the conditions of a sufficient 
statistic. Indeed the contours of Ry suffer the previously mentioned chimeric 
phenomenon. For instance, surfaces like those in Fig. 2.19(a) representing a 
contour of Ry for three sized samples represent a collection of minimal sufficient 
partitions. Each partition is identified by the intersection of the above surface 
with the plane )\;", x; = for a proper c. However, if we rotate the reference 
framework of the sampled points rigidly till we get one axis (say, the Z axis) 
coinciding with the bisector of the first octant, Xm maps into a vector Ym 
whose coordinate along the bisector axis freezes. Once this coordinate has been 
fixed, Ry is a minimal sufficient statistic w.r.t. the parameter © of a Gaussian 
variable whose sampled points’ coordinates read {(a1 — £2)/ V2, — (z1 + 22 — 
2x3) //6, — V37}. This is usually reported as the fact that the number of degrees 
of freedom of the sample is m — 1 in place of m because of the further relation 
` 22We prefer here to refer to this kind of seed rather than to a uniform variable because the 


outcoming mechanism is easier to deal with though maintaining the power of explaining any 
Gaussian variable. 


Confidence intervals 95 


yO z; = m7 used within the sample to fix the value of R. 
| 


In our perspective we read the fact through the key feature constituted by the 
pledge points. 


Definition 2.8. Given a random variable X with range ¥ and a random sample 
Xm, a statistic S = h(X,,) has a set €, of v pledge points — £, C X, v possibly 
depending on Xm — if S is not a minimal sufficient statistic, but a mapping 
Xm — Ymexists from ¥™ to X¥” preserving the independence of the single 
items such that: 


e for fixed values of some v items of Ym S is a minimal sufficient statistic; 


e S is a function of the sole remaining m — v items of Ym. 
| 


Definition 2.9. We call debt p(m) of S = h(X m) the supremum of the cardinal- 
ities of the pledge point sets and number of degrees of freedom of S the quantity 


m — p(m). 7 


This is how some degrees of freedom are burned in a sample. Besides the 
case mentioned, this occurs when we look for minimal sufficient statistics in 
the framework of linear models, such as linear regression or ANOVA models 
(Morrison, 1967]. Other degrees of freedom are worn away by a different kind 
of point, called sentry points, as we will explain in the following chapter. Note 
that things may be different if we refer to the joint sufficient statistics. For 
instance, in the case of the Gaussian distribution in Example 2.15 the joint 
sufficient statistics are those mentioned in Example 2.14. Using the latter has 
the benefit of no burning pledge points. But the drawback of not working with 
independently distributed statistics as X and Ry are. 


2.2 Confidence intervals 


Once you know a parameter distribution law you have any feasible information 
about it in our probabilistic framework. Hence if it represents the sole free 
parameter of the phenomenon you are considering, you may ask any well for- 
mulated question about the phenomenon. A first type of question goes with the 
format of confidence intervals. For instance, we cannot claim to deduce exactly 
the percentage of unsolicited e-mails you will receive in the future just from 
observing what came in the mailbox during the first month of life. Rather, as 
this percentage is a random variable, you may look for ranges of values within 
which the variable has a good probability of lying. For instance you may wonder 
what is the percentage you expect at most or at least (or both) with probability 
0.900. This is the philosophy behind confidence intervals. Namely: 


Sufficiency ... if 
you pay pledge 


A rate of junk 
mails of 30-42% 


with a confidence 
equal to 0.900 


that you will 
appreciate in long 


_series of 
intervals. 


The interval 
extremes are 
quantiles of the 
parameter 
distribution law 


96 Algorithmic inference 


Definition 2.10. Given a random variable with parameter ©, a sample x = 
{x1,...,;2m} and a real number 0 < 6 < 1, (6;,0,) is called a 1 — ô confidence 
interval for © if 
P(6, << O < @) =1-—46 (2.62) 
The quantity ô is called confidence level 7°. 
E 


It is also understood that 6; and 6; are functions of x. As usual, this def- 
inition’s value is appreciated in terms of asymptotic behavior of our inference 
action. In a long history of samples x’s and confidence intervals computed on 
them with a given confidence 6, this value closely approximates the frequency 
with which we measure in the population prosecuting a sample a parameter 
falling in the related interval. Indeed, with whatever distribution law the sam- 
ples appear in the history and no matter which Os are questioned in their 
prosecutions, considering the specifications of Os with no mention of X 74 re- 
turns a random variable =, obviously depending on the single distribution laws 
in the history, but with P(E € confidence interval) still equal to 1 — 6. 


2.2.1 Solving inverse problems to find the confidence interval 
extremes 


By definition 
P(6; <O< 6.) = Fo(0s) = Fo (0i) (2.63) 


Recalling the definition of the a quantile of a random variable X (the spec- 
ification ĉa of X such that Fx (ĉa) = a), we easily realize that the extremes 


of a confidence interval of O are suitable quantiles of this parameter. In turn, 
the distribution law of © is a function of the specifications we observed of the 
random variable X which © refers to. Hence the quantiles are a function of 
the X sample, too. In particular, if we are interested in unilateral intervals, for 
instance pushing 6; to —oo, then the mathematical problem reads the following 
inverse problem: find s such that 


Fo(6;) =1—4 (2.64) 


whose solution 6, is the 1—6 quantile of ©, modulo some approximation in case 
of © discrete (and then Fe stairshaped) 7°. 


23 As © is a random variable in our framework, we decided to slightly modify the canonical 
definition of the confidence interval by substituting < with < in the inequality concerning the 
right extreme. In this way we compute the above probability right as a difference of c.d.f. Fo 
instantiations. 

241t looks like the egg-chicken dilemma: Xm will be distributed according to the distribution 
law determined by 6, i.e. the distribution law of its suffix. In turn, © will be distributed 
compatibly to the observed xm. To break this loop Kolmogorov approach assumes @ given, 
though unknown, and offers arguments about Xm. We prefer starting with a given sample 
Xm and offering argumentation about O, since we find it both more logically sounding and 
more efficient for solving complex inference problems. 

25Tn this case indeed we could be obliged to decide between an F9(64) a little less than 1—6 
and an Fe (0) a little greater than 1 — ô. 


Confidence intervals 97 


Example 2.16. Consider a sample x drawn from a negative exponential distri- 
bution of parameter A. According to (2.21), (—00, As] is a confidence interval 
of level 6 for A if 


Ar = mE (2.65) 


i! 
i=0 
where s = J- ;-; xi. Therefore the (1 — 6) quantile \;-5 = Às can be found 
through the solution of a relatively simple equation. 

Indeed, as X; follows a negative exponential distribution of parameter A, 
Tma = ae X; follows a Gamma distribution with parameters m and A (see 
Fact B.6). Moreover AX; follows an exponential distribution of parameter 1 (as 
follows from a simple scale change) and Fm, = A772, X; follows a Gamma 
distribution with parameters m and 1. In own turn, the latter is denoted as a 
Chi square distribution y3,,, [Wilks, 1962] 2° (see page 347) having parameter 
r = 2m evaluated in the double of Tm, specifications. Namely, 


F(A; s, m) = Frm (8) = Frm 1 As) = Fa, (2A8) (2.66) 


m,1 


Hence an easier way than through (2.65) to solve (2.64) is to solve equation 


F, (xz) =1-6 (2.67) 

2 
getting the quantile x?_ş, and then Às = Xis from the inversion of the equality 
Dies SF 5 (2.68) 


This entails a linear relation between A and “ that we easily draw in Fig. 2.20(a) 
to show the dependence of the width of the confidence intervals on this statistic. 


Analogously, we may compute a confidence interval unlimited on the right as 
follows. 


Example 2.17. Example 2.16 extends straightforwardly to the computation of 
an unilateral confidence interval of level 6 for A whose form is [\;, +00) 7’. Here 
we invert the relation 

2r\is = x3 (2.69) 


to obtain the course of the confidence interval shape w.r.t. the statistic shown 
in Fig. 2.20(b), contrasted in Fig. 2.20(c) with the previous graph. 


E 
If we are looking for a bilateral confidence interval, then the inverse problem 


Fo(0s) — Fo(0i)=1-— ô (2.70) 


Confidence 
regions 


via twisting 
quantiles. 


Bilateral 
confidence 
intervals 


98 Algorithmic inference 


Fig. 2.20: Drawing unilateral 0.900 confidence intervals for the parameter A of a 
negative exponential distribution, on the basis of a sample of size m = 10. X axis: 
m/s, where s is the sum of the sample elements. Y axis: 0.900 confidence interval 
corresponding to the value for m/s in the abscissa. (a) intervals unbounded on the 
left, (b) intervals unbounded on the right, (c) confronting the two kinds of intervals. 


fu 
0.8 0.8 
0.6 0.6 
0.4 0.4 
0.2 0.2 


Fig. 2.21: Drawing a confidence interval from a distribution symmetric around its 
mean. Thick curve: density of a Gaussian distribution, describing the mean M of a 
Gaussian variable as in (2.81); gray region measures the probability = 0.900 that M 
falls in the subtending confidence interval; (a) and (b) refer to a symmetric interval and 
an asymmetric one, respectively. Each region is contrasted with the other, delimited 
by dashed lines. 


Confidence intervals 99 


proves undetermined. 

A further constraint removing this indeterminacy is usually constituted by 
the symmetry of the tails. Namely, we equipartition the probability that © lies 
outside the interval either on the left or on the right. The semantic value of this 
strategy — giving the same disappoint to the two events — is accompained by a 
syntactic one when the distribution law of © is symmetric around its mean. In 
this case we also get the confidence interval symmetric around the mean, which 
ensures the minimum width 6,—0; for any 1—ô (see Fig. 2.21). This entails fixing 
= 95/2 and 6; = 95/2 such that Fe(1~5/2)—Fe (95/2) = 1—6/2—6/2 = 1-0. 


Example 2.18. In the case of a negative exponential random variable X, we get 
the extremes of a confidence interval for its parameter A by solving the following 
equations: 


apga de (2.71) 


Falà) = (2.72) 
Thanks to their linearity with 1/s the plots of A; and A, with m/s give rise to 
the gray region in Fig. 2.22(a) delimiting an angle that represents a confidence 
region for the parameter. 

Equations (2.68) and (2.69) represent the final operational consequence of 
the twisting argument (2.18). Here we twist directly quantiles on our choice. We 
may decide twisting x? quantiles with those of A, as we did in Fig. 2.22(a). But 
from the same equations we draw the quantiles of + as well, where the choice of 
the statistics, either = or % is just a matter of our convenience. For instance 
Fig. 2.22(b) reports the same confidence intervals and numerical trajectories 
for A but in a less comfortable way of Fig. 2.22(a) because we have decided 
showing the dependence on the former statistic in place of the latter. Finally, 
in Fig. 2.22(c) we obtain again the simple picture reporting + in function of =. 


Example 2.19. Wanting to repeat the computations for the parameter P of a 
Bernoulli variable, we note that only bounds are available for Fp(p). Let us 
introduce the function Jg for simplifying the notations, defined as: 


h-1 


Ta(hr) =1- 30 4 1 Hea - pari (2.73) 


i=0 


This function is denoted incomplete Beta and represents the cumulative distri- 
bution function of the random variable Zge(n,r) following a Beta distribution of 


26 because of Greek alphabet peculiarity, we will denote with the same symbol the variable 
and its quantiles. 

27We obtain the same extremes focusing on intervals having form (—oo, As) or (Ai, +00), as 
the investigated parameter follows a continuous distribution. 


Symmetric is 
generally better 


The description 
format is up to us 


The incomplete 
Beta 


An undervalued 
confidence 


The operational 
scheme 


100 Algorithmic inference 


-o2 6.04 0.06 0.08 01 6.12 © 20 ao e0 80 100 120 i40 Ú 


alle 


1 
25 50 75 100 125 150 


(a) (b) (c) 


Fig. 2.22: Generating 0.900 confidence intervals for the parameter A of a negative 
exponential random variable with a sample of m = 20 elements. 7 = =: sample 
mean. Gray region: confidence region for the investigated parameter. (a) confidence 
region for A w.r.t. the inverse of the sample mean; (b) confidence region for A w.r.t. the 
sample mean; (c) confidence region for + w.r.t. the sample mean. Black lines report 
20 trajectories of the parameter (in the two alternative expressions) in the ordinates 
with the actual statistic in the abscissas, when ranges from 0 to 100, for different 
seeds. 


parameters h and r (see Appendix B for details), that is P(Zge(h,r) < 3). With 
this notation (2.20) reads 


Ip(k,m—k+1) > Fp(p) > Ip(k+1,m—k) (2.74) 
and the best we can do to satisfy (2.62) is to take 
he(k+1,m—k)—f,(k,m—-—k+1) (2.75) 


as lower bound to Fp (ls) — Fp (L) = P(li < P < ls). Applying the symmetry 
constraint directly to these bounds we get the extremes of a confidence interval 
(pi, Ps) with confidence actually > 1— ô as the solutions l; and ls of the equations 
system 
l (k+1,m-— k) = 1 -— /2 (2.76) 
l (k,m-— k+ 1) = 8/2 (2.77) 


The plot of these extremes with k gives rise to the cigar-like confidence region 
in Fig. 2.23. 


The solution of the above examples consists again in a twisting operation. 
We twist quantiles of the parameter distribution with those of the statistic 
distribution. The general scheme is 


ba = (|S 9(a),0 = so) (2.78) 


Confidence intervals 101 


Fig. 2.23: Generating 0.900 confidence intervals for the mean P of a Bernoulli random 
variable with same notations and different seeds as in Fig. 2.5. Curves: trajectories 
described by the confidence interval extremes when the observed number k of 1 in the 
sample ranges from 0 to m. 


where: 
e so is the observed statistic, as usual, 


e sg is the value assumed by the above statistic when © = 6, after the 
master equation (2.5), 


e g(a) is a suitable function of a depending on the monotonicity relation 
between sg and 0 exploited by the twisting argument, 


e s, o is the y quantile of the above statistic when © = 9. 


For instance g(a) = a in the case of a negative exponential variable and g(a) = 
1—a in the case of a Bernoulli variable. The actual implementation of (2.78) may 
deserve some computational difficulty. However with a common PC, possibly 
with the help of specific packages*®, we may always write down a routine that 
has in input a and se outputs a with a satisfactory approximation. 

Given the Gaussian distribution law’s widespread use in many operational 
frameworks, its parameters play a special role in statistics . This couples with 
a very easy derivation of the distribution of their parameters through the above 
scheme. Let us start with the case where only one parameter, either the mean 
or the standard deviation of the random variable, is unknown. 


Example 2.20. Let X be a Gaussian variable with parameters u and ø and 
{x1,...,2m} a specification of a sample drawn from it. We may quickly get the 
distribution laws of each one of these parameters, when the other is known, as 
follows. 

From Example 2.10 we know that in these conditions sm = Da a; and 
sy = 0", (a — u)? are minimal sufficient statistics w.r.t. p and o respectively; 
hence the distribution law of these parameters will also be a separate function of 


?8such as Mathematica [Wolfram, 2003], Maple [Aratyn and Rasinariu, 2005], Scilab 
[Gomez, 1999], and so on. 


Shortcuts are 
possible 


since we may 
pivot around a 


_. Normal 
distribution, 


since we may 
pivot around a 
Chi square 
distribution, 


and in general 
with a pivotal 
quantity. 


102 Algorithmic inference 


these statistics. According to the sampling mechanism (2.59) in Example 2.15, 
here we derive the two logical implications: 


H < m < sus Si (2.79) 
o< & sy <se (2.80) 


Hence 2° 


Pryjo(u) =1— Fs, (sm) = 1- Fz (eet) = Fz aoe (2.81) 


where the third element of the equation chain refers to the Normal variable 
Z, and the last item comes from the symmetry of this variable around 0. The 
simple operational rule for implementing (2.78) w.r.t. M is: consider the pivotal 


quantity 7 = (4). take the quantile za of the Normal variable Z and 


solve the equation T = za in u, which leads to: pa = su /M+ zao / ym. 

For instance, if we want a symmetric two-side 0.900 confidence interval for 
M we fix Haw = 8u/M+ 20.0500 / yM and Hup = Su /mM+ 20.9500/./m, so that 
P(uaw < M < Hup) = 0.900. 

Analogously 

sy 
Fy,(o) = 1 — Fs, (sn) = 1 — Fa, (>) (2.82) 

where x2, is a Chi square random variable with parameter m, whose density 
function is: 


1 1 m/2 
renga m/2—1 —1/2x 7 2 
app (5) oroa) 83) 
where T (k) = (k — 1)! for k € N °°. The analogous operational rule is: consider 
the pivotal quantity 7 = (3), take the quantile x7_. of the variable x2, and 


solve the equation 7 = NG in o, which brings: Ca = a 
l-a 


For instance, if with m = 20 we want a one-sided 0.900 confidence interval 
for ©? having 0 as the left extreme, we fix ofp = U6 999 = T=/X6.100 = =/12.44, 
so that P(0 < ©? < rs /12.44) = 0.900. 


Remark 2.6. This reasoning extends to any parameter © for which a pivotal 
quantity Z can be found, i.e a total and invertible function of 0 that is monotone 
in a statistic S and in no other way depends on the sample, and such that the 
distribution law of Z does not depend on @. It is generally referred to as the 
pivotal quantity methods and is used in the Kolmogorov framework for finding 
confidence intervals*!. = 


29See footnote 31 in Chapter 1 for notations based on character “m.” 

30its full expression is P(t) = ioe xt—le-®da. 

31 Actually it represents (nearly) the sole way, when applicable, of finding confidence intervals 
in the Kolmogorov approach. 


Confidence intervals 103 


Scheme (2.78) holds in the case where we may exploit a twisting argument. 
We do not care, however, how the parameter distribution law has been derived. 
Hence if we obtained it by bootstrapping populations we will numerically invert 
(2.70) directly after suitable partition of the distribution tails. 


Example 2.21. Using the same notations as in Fig. 2.22, Fig. 2.24(a) shows how 
the confidence intervals for the parameter A of a symmetric negative exponential 
distribution depend on the values of the sample’s median. In this case, quantiles 
are numerically approximated on the basis of an empirical c.d.f. computed as in 
Example 2.3. Analogously, Fig. 2.24(b) illustrates the shape of 0.900 confidence 
intervals for the parameter K of a hypergeometric distribution versus the values 
of the statistic h described in Example 2.4. Also in this case, quantiles arise 
from a numerical procedure on the related empirical c.d_f. 


2.2.2 Checking the coverage 


The same method that led us to draw confidence intervals in the last two exam- 
ples may work for checking the amplitude of the analogous interval computed 
analytically. Virtually, we bootstrap parameter populations in correspondence 
with given values of the pivot statistic and check that the rates by which popula- 
tions exceed the interval extremes essentially coincide with the confidence level 
with which the extremes have been computed. More simply, we draw seeds and 
report on the confidence regions the trajectories (sọ, 0) deriving from the master 
equation (2.5). 


Example 2.22. According to (2.6), the linear segments in Fig. 2.22(a) repre- 


sent the set {(—7“4—., \)} for randomly extracted seeds {u1,..., Um} with 
dizi (— Inui) 


m = 20 and X spanning from 0.5 to 100. Of course, we obtain hyperbolic 
segments if we prefer to represent the pairs (statistic, parameters) in terms of 
(Sac) à), and so on. In any case we observe that, as expected, nearly 
100(1 — 6) percent of the lines fall within the analytically computed bounds. 
With similar computations we may check the coverage of the confidence region 


in Fig. 2.24. 
E 


Example 2.23. Repeating the above experiment for the Bernoullian variable, we 
now get the picture in Fig. 2.23 where the two curves come from the solution of 
the equalities (2.76-2.77) with ô = 0.1 and k running parameter. For didactical 
reasons, we maintained the same fret lines as in Fig. 2.5 in the assumption 
that the step ordinates, as frequencies of ones in relatively long suffixes, almost 
coincide with the probabilities of having one in the suffixes — which are the 
parameter specifications we are looking for gauging in the confidence intervals. 

We cannot perform a similar experiment for short prosecutions of an actual 
sample, since the number of 1’s therein do not define a random variable, as 


The twisting 
argument is 
needed 


no matter how 
the parameter 
distribution is 

obtained. 


Are these 
intervals | 
sufficiently tight? 


The locus of the 
most probable 
statistic- 
parameter 

pairs 


An approach true 
also for small 
populations 


and suitable also 
for joint 
parameters. 


104 Algorithmic inference 


6250.50.75 1 1.25 1.5 1.75 27 


(a) (b) 


Fig. 2.24: 0.900 confidence intervals (a) for the parameter A of a symmetric negative 
exponential distribution versus possible values for the median of a sample of size 
m = 100; (b) for the parameter K of a hypergeometric distribution with parameter 
N = 50, on the basis of a sample of size m = 10. The parameter distribution are 
obtained through bootstrapping procedures using n = 1000 and n = 500 replicas, 
respectively. Same notations as Fig. 2.22. 


mentioned before. However, to appreciate the operational counterpart of a 
probability measure in this case too we generated a variety of pairs (sample, 
population) from a Bernoulli variable with p ranging from 0 to 1 under the 
constraint of both k+K and n+N as in (1.10) being constant. Then we grouped 
the strings per number of 1’s in the suffix and computed their frequencies. In 
Fig. 2.25 we see that, as we may expect, probabilities appreciated through (1.10) 
go around but do not coincide with these frequencies. 


2.2.3 Devising confidence regions 


Inverting the joint distribution law of more than one parameter may result very 
easy. This happens for the extremes of a uniform distribution. 


Example 2.24. Continuing Example 2.12, we are in search of a region in the 
plane (a,b) where the extremes A, B distributed according to (2.50) take spec- 
ifications with probability 1 — ô. Given the constraints A < s4,B > sg we 
decide focusing on a rectangle defined by a; < a < sa and sg < b < bs, which 
specifies the condition: 


F'4.3(84,bs) — Fa.p(ai,bs) =1—6 (2.84) 


as Fa B(sa,sBg) = 0. The usual strategy of bipartitioning ô and substituting 
F4, B(S4,bs) with its lower bound and F4, (ai, bs) with its upper bound emerg- 


Confidence intervals 105 


Fig. 2.25: Relationship between frequencies and probabilities in 60 families of pairs 
(sample, population), each obtained by sampling from a Bernoulli variable where p 
rises from 0 to 1 with step 0.01. With reference to equation 1.10, m = 20, M = 5, 
k+ K = 10. Horizontal axis: K, vertical axis: frequencies ¢@ of strings with given k 
(bullets) and companion values of P(k;m, K, M) (line). 


ing from (2.50) brings to unfeasible solutions of the following equations: 


1- (3—4) — E (2.85) 


SB — SA a ô 
1- | —— = > 2.86 
( bs — üi ) 2 ( ) 
with a; > sa. Hence we prefer to approximately compute the second term in 
(2.84) as Fa Blai, bs) © (=) since F4 (ai, bs) < P(A < ai, B > sp) and 


SB—îi 


P(A <ù, B > sp) > (z=) . Thus (2.86) moves into 


(==) = ° (2.87) 


SB üi 
They reveal linear equations in a; and b, having solutions: 


1 
u/s 
2 


1 
bs = sa + (sB — sa) —= (2.89) 


o 

2 

Thanks to the box form of the confidence region we may draw separately the 
course of the confidence interval for A — whose upper bound is sa and the lower 
bound is a; as in (2.88) — and for B — whose upper bound is b, as in (2.89) 
and the lower bound is sẹ. In Fig. 2.26 we drew these regions for m = 10 and 
ô = 0.10. The extremes of the two intervals coincide for są = sg, an event of 
probability 0 for actually continuous X. 


ai = Spg — (SB — Sa) (2.88) 


The marginal =? 


106 Algorithmic inference 


Fig. 2.26: Two sided confidence intervals for: (a) left parameter A and (b) right 
parameter B of a continuous uniform distribution law as a function of the statistics 
s4 = (1) (in the X axis) and sg = xm) (in the Y axis). 


With a Gaussian variable things are a bit more complex, yet equally neces- 
sary given the widespread use of this variable. 


Example 2.25. The Gaussian distribution represents another typical instance 
where two parameters have to be inferred simultaneously from a sample. Re- 
turning to Example 2.20, logical implications (2.79) and (2.80) both have the 
drawback of requiring that the parameter not involved in the implication is fixed. 
Vice versa, considering that x;—Z = o(z;—2Z), and defining ry = 7", (a —2)?, 
we may substitute (2.59) with relation 


a<o87y <r (2.90) 


which holds for any value of u and still pivots around a sufficient statistic but 
with a pledge point (see Example 2.15). Namely from (2.90) we have 


Fy2(o) =1- Fr, (rs) =1- Fe (5) (2.91) 


m—1 o2 


where the last equality comes from the well known result that in the case of X; 
Gaussian (1/0°) 30", (X; — X}? follows a Chi square distribution with param- 


ng. Chi eter m — 1 [Kendall and Stuart, 1961]. Hence: 


falo) = fx, (3) eae 


o2 


(m—1)/2 = = 
1 1 ry \(m 1)/2-1 _ary rs rū 
o ee = 26 =] oo (5) 2. 2 
mest (5) (5) e 2e? 52 (O,+00) (72 (2.92) 


Confidence intervals 107 


Then, using conditional distributions as in Example 2.20, the c.d.f. of the 


vector (M, ©) is computed as 
Fm,s(4, 5) = 
Gi (m—1)/2 L _ 
1 1 rg (m—1)/2—1 -ig rs 
Ti(m — 1)/2) D 72 20 =F = 

[eoa GE Aa ta 
Tem 1 NO a E a u-su/m)\ | 
o F((m-— 1)/2) \2 o? r Joz/m 

(2.93) 


where Z is a Normal variable. Analogously, either computing the second deriva- 
tive of Fiy.n(u, 0), according to (2.93), or directly combining unconditional and 
conditional distributions, we obtain 


1 1\ @=1)/2 ry; \ (m—1)/2-1 
fm, s(u, 0) = T((m — 1)/2) (5) (5) 


ones u—sm/m Qro2 9 O4 
e73 Spe (ae ao (2.94) 


By marginalizing (2.94) (i.e. integrating) either with respect to ø or with 
respect to ys we obtain the marginal density function of M and »?, respectively. 
The latter will coincide with (2.92), of course, while the former is: 


T'(m/2) m 


ful j= (2.95) 
T((m—1)/2) mr (1 m Gaa e 
Tom 
. x — Me—sM _ HPT _ 
i.e., defining t = “T= = r=/(m(m—1))’ 
T(m/2 1 
soe T (2.96) 


~ T((m—1)/2) am — (0 +e)” 


representing the density function of the well known Student’s t distribution law 
(Student, 1908] (see page 347). 

Hence, in the case where both and o are unknown, with m = 20 a 0.900 
two-sided symmetric confidence interval for M is delimited by the extremes 
Mo.050 = To.050,/7=/20+2% = —1.780,/r5/20+2 and Mo.9500 = To.9500./7=/20+ 
z = 1.780,/rs/10+ 7. Thus 


P(-1.780./7r5/20 +  < M < 1.780,/T5/20 + Z) = 0.900 (2.97) 


Analogously, the one-sided 0.900 confidence interval for £? having 0 as the 
left extreme is bounded on the right by the value D2 999 = T= /Xé.199 = T= /11.65, 
i.e. 

P(0 < £? < rx /11.65) = 0.900 (2.98) 


The joint M, & 


The marginal M 


involving Student 
distribution. 


k A trivial 
confidence region 


and a smarter 
one. 


108 Algorithmic inference 


Note how both these intervals are larger than those in Example 2.20. This 
is a penalty we pay due to the very poor information we have about the X 
distribution law. In particular the distribution law of M completely changes, 
while the distribution law of X? is still a Chi square based one yet with parameter 
m — 1 in place of m. 

Finally, if we want a confidence region for the vector (M, 52), we have two 
tools: 


1. As in the case of uniform distribution, we may look for a rectangular 
domain measured through the cumulative function Fm,s2. For instance, 
wanting no lower bound for g? except 0 and u extremes symmetric around 
sm/m (actually the mean value of M), we may manage on Gup and pup — 
sm/m = sm/m — paw such that Fy? (Mup, 02>) — Fixe (law, Tp) = 
0.900. In this way we obtain the plain line rectangle in Fig. 2.27. 


2. While easy to implement, the above solution may prove slightly informa- 
tive, since the rectangle we draw contains a great portion of points with 
very small density probability. Thus stretching the shape of the confi- 
dence domain a bit, we may get a definitely narrower area, hence with less 
dispersed values among which we may expect to find the parameters with 
a non disregardable probability. To this aim in the measure 1 — 6 of the 
confidence region: 


Tip Hup (07) 3 5 
1-55 |” f fl fue ldndo (2.99) 
0 Haw (0?) 
we decide to fix 
Hawla’) = (Mlo*)q_ azayja ~ (Mo) 54 (2.100) 
Hup (07) = (M|o*),_a_yi=aye2 = (M|o’)i—5/4 (2.101) 


so that P(uawlo?) < M < pup(o?)) = V1- ~ 1-— 8/2 for each o°. 
Analogously we fix ofp = rer, so that P(X < ofp) = V1— 0. Substi- 
tuting these probabilities in the integral in (2.99) we find that the whole 
region has measure 1— ô. The shape of this region has the parabolic form 


in Fig. 2.27. 


For the sake of completeness, in Fig. 2.27 we also draw with a dashed line the 
rectangle edges obtained by substituting the marginal distribution of M in place 
of its conditional distribution to compute Hup and Haw with the second tool. 


2.3 Point estimators 


In face of a phenomenon like our mailbox whose future observations we cannot 
determine as of now, whatever operational decision we take about these obser- 
vations may prove wrong when not trivial if we expect a certain result. You 


Point estimators 109 


RON WB oO DN OF 


Fig. 2.27: 0.900 confidence regions for the vector (M, £?) of the parameters in a Gaus- 
sian distribution. Plain rectangle: region with fixed extremes for both components. 
Curve: region with M extremes depending on o°. Dashed rectangle: region computed 
considering the marginal distribution of M in place of the conditional one. 


may decide to spend one hour per day cleaning the mailbox from junk e-mails. If you need 

But this amount of time, even accumulated over many days, may prove too fochy e 
little or too much, despite all the full knowledge we got in Sec. 2.2 about the 

distribution law of the asymptotic percentage of these mails. A reasonable thing 

you can do, beyond giving up this communication tool, is to try to minimize 

the damage coming from the difference between planned and actual waste time 

percentages. This strategy passes through the definition of a loss function and 

an optimization problem as follows. 


Definition 2.11. Given a random variable X with parameter ©, a sample x and try to minimize 
an operational framework @ 32, a loss function I(O, 6) w.r.t. a point estimator ĝ tHe damage. 
of the value 0 the parameter will assume in the suffix of x is a function computing 

the loss we incur in giving value Ê to © within A. The related risk function 

&(6) is the expected value E[I(@, 6)| of the loss as a function of 9. 


When, in a given operational framework, the loss function is well posed and 
the distribution law of O is manageable, a suitable strategy is to identify a 
point estimator of 0 with the solution of the minimization of the risk w.r.t. its 
argument. 

Here below we will examine a very general purpose loss function whose risk 
minimization gives rise to a family of estimators with interesting properties. 


2.3.1 Square error 


The most common loss function employed in statistics in absence of any commit- 
ment is the square of the difference between @ and the O specifications, thanks 


to its symmetry, clear meaning and easy computability. The related risk is the You minimize the 
estimate- 
—=—. OU ee ee parameter 


32 constituting a special compartment of the computational context C (see Definition 1.1). distance 


by estimating the 
parameter with 
its mean value. 


Main ingredient: 
_ the minimal 
sufficient statistic 


Sample frequency 
in place of p for 
X Bernoullian 


110 Algorithmic inference 


Mean Square Error MSE[, 6]: 
MSE[O, 6] = E[(@ — 6)?] (2.102) 


Since the risk is a quadratic function, its unique minimum occurs where its 
derivative is 0. Now E[(© — 6)?] = E/©?] + 6? — 20E[O], and its derivative 
vanishes for 0 = E[O]. We qualify this solution as an unbiased estimator. 


Definition 2.12. For any sample x from a random variable with parameter © 
a statistic 0 = h(x) is an unbiased estimator of 0 if and only if 


6 = EJO] (2.103) 


In this case, too, the operational benefit is asymptotically appreciated. If we 
decide to adopt unbiased estimators for a long series of independent decisions in 
different @’s (and © distributions as well), each having a finite risk, then we get 
the minimum cumulative loss. Rather, if the loss is linear with the difference 
6-6 having occurred with possibly different 0 in any @, then our decisions are 
equivalent to having used the correct 0 in any @. 

The shape of the function Ê = h(x) depends on the © distribution law. The 
latter depends, in turn, on the sample through minimal sufficient statistics in 
the limits of Theorem 2.1. Thus we have the following result: 


Theorem 2.2. If X follows a regular distribution law with parameter ©, an 
unbiased estimator of it is a function of a minimal sufficient statistic. 


Example 2.26. Having a string of m bits as a specification of a sample X of 
independent Bernoulli variables with parameter p, if k is the number of 1’s, then 
the unbiased estimator of p is gauged by: 


k <p< k+1 


2.104 
m+ m+ ( ) 


m. 


Indeed E[P] = fj 1 — Fp(€)d€ from Definition B.6, so that E[Zx+1ym—K] > 
E|Zk,m-k+1], Where Zk,m-ķ+1 is a Beta variable defined through (2.73). The 
claim follows from the fact that jeu — Ie(k,m — k + 1))d = A With large 
k and m the two extremes converge to k/m. On the contrary, for small k and 


m we can get a suitable estimator given by 


k+1 


— 2.1 
m+2 (a0) 


p = 
as the mean value of the frequencies on 1’s in short future sequences, say of 
length M, where the probability of having K 1’s is given by (1.10) and the 
mean is taken after normalizing these probabilities over K 33. Namely 


Point estimators 111 


(PE) we *) 


= M M et 
BP] = 83 | = se gee a 
y M F k+x\(m-k+M-x 
M é=0 M ŽD P(x) €=0 M sae a m+2 
(2.106) 
oO 


Remark 2.7. We interpret (2.106) as follows: if we did not acquire great ex- 
perience through the sample, we make predictions on the future by adding two 
bits to the observed string, where we fairly set one to 1 and one to 0. This is 
an option that makes the marginal probabilities emerging from the augmented 
symmetry ensembles adding to 1 and therefore asymptotic frequencies in the fu- 
ture adding to one as well. This result, though coinciding with the mean value 
obtainable through the Bayesian approach for a uniform a priori distribution 
of P, was obtained formerly by Laplace on 1818 through the above motivations 
that he explained as a rule of succession [Laplace, 1814]. 


The fairest bit 
prediction in 
absence of true 
information: 
“one 1 over two 
bits” 


Example 2.27. Having a string of m values as a specification of a sample X of Sample mean in 


independent Gaussian variables with mean u, we get as its unbiased estimator 
Ê: 
1 m 
ji=— St =F (2.107) 


i=l 


Indeed, from Fact B.3 and (2.81) we know that Z = — follows a Normal 


distribution law, hence M follows a Gaussian distribution with mean X. We 
get the same result even if o is unknown. Indeed, in this case, denoting s = 


4 Ge Te, T= A follows a Student’s t distribution having mean 0. 


With a more direct computation, the unbiased estimator of o? is: 


2 — m—2 
o* = = 
Xi (@i- 7)? 


m—3 


. e if u is known (2.108) 


if u is unknown 
since Fyz (o°) = 1 — F,2(s/o”) with parameter m in case s = J; (£; — yu) 
and m — 1 in case s = } `; (x; — 7)”. Hence 


2 


+00 
By? |= s| Tpi (2.109) 


with the analytical integration coinciding with the second terms of (2.108) de- 
pending on the Chi square parameter. Note that from Jensen inequality (see 


Fact B.11) E[Y] < VEJEN. a 


33Remember that in the framework where (1.10) was derived we have a different sample 
space for each K. 


place of u for X 
Gaussian 


Inverse of the 
sample mean in 
place of à for X 
negative 
exponential 


Unbiased 
maximum in 
place of 0 for X 


uniform in (0, 0) 


112 Algorithmic inference 


Example 2.28. Consider a string of m values as a specification of a sample X of 
independent negative exponential variables having parameter À. The unbiased 
estimator of the random variable A underlying this parameter is 


N m 
p o (2.110) 
i Ti 


Indeed, from Example 2.16 we know that Fa (A) = Fs(s), where S follows a 
Gamma distribution with parameters m and À and s = )>,", 2;, hence: 


Fa(à)=1-— s is (2.111) 


i=0 


which reads the c.d.f. of a Gamma distribution of parameters m and s as well. 
Hence, as in Table on page 345 we have 


EJA] = — = —=,_— =A (2.112) 


Example 2.29. Consider a string of m > 2 values as a specification of a sample X 
of independent random variables uniformly distributed over [0, 6]. The unbiased 
estimator of the random variable © underlying this parameter is 


m 


Tim) (2.113) 


m— 1 


where £m) = MaXj=1,...,.m{2i}. Indeed, as the joint distribution of the sample 
values is 


1p 1 
L(t1,---,%mi0) = ae Į [ %0.01(#) = Gin M0,+50) (@(1) )A(—c0,6](@(m)) (2-114) 
i=l 


according to (2.34) £(m) is a sufficient statistic for ©. Therefore we can use the 
following twisting argument 


(0 < 6) > Ge = Best) (2.115) 


to derive the c.d.f. for © (where T(m) denotes the value the statistic £(m) will 


assume if the distribution parameter is raised till 0), which reads 
Fo(0) =P (em) > er Izen +00) (9) = 


m = _ Lim) m P 


i=l 


Point estimators 113 


where the indicator function nullifies the c.d-f. in case 0 < Tim) 34. Thus the 
expected value of © is 


+oo Tim) +00 m 
za = f 1- Fe(oao= f a f (=) do = 


(m) 


Example 2.30. Continuing Example 2.24 we get the unbiased estimators of both 
A and B by simply applying the continuous release of (B.85) to the distribution 
(2.50). Hence: 


EIA] = F if aait (2.118) 


E|B] = F f” ttanta baado (2.119) 


Let us define the binding expressions fi}fs(a, b), A pab) for fa,p(a,b) by 
deriving the bounds on F4,g(a,b) in (2.50). We note that for m —> +00 the 
values obtained in any of the equations (2.118-2.119) substituting f with the 
two densities ff and fs"? collapse in the same expressions. Namely: 


A SB — SA 
=SA4- 2.12 
a, ar (2.120) 
(=i 2 (2.121) 
m—1 


Hence the estimators drift from the extreme order statistics by a quantity that 
depends not only on m, as suggested by conventional estimators of the single 
extremes (see [Kendall and Stuart, 1961]) but also on the range of the observed 
values. This matches very well our intuition. 


2.3.2 Shortening the procedures 


Minimizing the estimation risk, as in the case of unbiased estimators, may prove 
a long procedure and generally requires the use of a computer since it starts with 
the identification of the parameter distribution law. In order to get quick eval- 
uation of a point estimator we may frame it within the Kolmogorov approach: 
“given the distribution law, you must discover its parameters on the basis of a 
sample drawn from it” (the second horn of the egg-chicken dilemma, see Foot- 
note 24). Here below we report a series of common estimating procedures that 
represent the quick version of ours, generally having weak forms of the properties 
discussed before. 


34 Indeed, recalling that the related explaining function maps u to ĝu, in this case we have 
necessarily Tim) = 9U(m) < 0 < £im)- 


Jointly unbiased 
statistics if both 
extremes are 
unknown 


To relax the 
computations 


take the 
expectation on 
the sample rather 
than population 


and get a weakly 
unbiasedness. 


Not always the 
shift is 
disregardable 


114 Algorithmic inference 


2.3.2.1 Weak unbiasedness 


Within a principled discrete framework, the uniform unbiasedness property of 
an estimator T of a parameter O: 


Elt-@]= X (t-9)Pe(0)=0 Yte Dr (2.122) 
0EDo 


implies the ensemble property: 


>X So 4) Pe(0) =0 (2.123) 


teDr 0EDe 


which reads a global bias equal to 0. An analogous result, namely 


Y So ¢-9)Pr(t)= Y So (t-4)Pr(t) =0 (2.124) 


teDr 0€Doe 0€De tEDr 


is achieved when 


BIT |= X. (t—6)Pr(t)=0 Yə € De (2.125) 
teDr 


The latter condition defines an unbiased estimator in the Kolmogorov approach 
that we denote as a weakly unbiased estimator. In this case we are supposed 
to observe a sample from a prefixed population with a given 0 and we want an 
estimator @ whose bias with respect to 0 is meanly 0, i.e. E[T'] = 0. The two 
estimator definitions are different. The defined statistics almost coincide either 
when the sample gets the size of a population or when we deal with distribution 
laws, like the Gaussian one, and with parameters like the mean, where the mean 
estimator sums in any case a true population of elementary values. 

Working with weak estimator may result in a very robust strategy when 
our knowledge on the data is extremely poor. The following fact represents the 
strongest reason for working within Kolmogorov framework. 


Fact 2.3. Consider a random sample X: as 
YLEX] = ux (2.126) 


the sample mean T is a weakly unbiased estimator for the mean of whatever 
distribution law °°. 


Although the two unbiased estimators of the mean collapse in Example 2.27, 
this is not always true. 


Example 2.31. Consider a random sample x drawn from a negative exponential 
distribution of parameter \ (having thus mean equal to +). Switching to the 


35 We should say rather that 7 is a specification of the estimator X, whose random behavior 
qualifies its estimating capability in the common statistical approaches. 


Point estimators 115 


random variable A of which A represents a realization, in order to be able to 
apply Definition 2.12, we need an analytical form for the density of A. Denoted 
s = oi", zi, this can be obtained from (2.21) as follows 


x dFa(d) Sa (sie? ER oe (As) Te 
ha = ay el a2 it (m—I) 

(2.127) 
The unbiased estimator for the mean of an exponential distribution can now be 
derived as 


ij fra pe ee as 
aA On I Geaa 


s TH m — 2)! 
Ti f u™—2e-“du = aS TIE So aj (2.128) 


(m—1 m—1)! m-14 


and therefore differs from the sample mean. 


Per se, unbiased estimators in the Kolmogorov approach prove useful and Test of | 
" 7 af. ik . hypothesis 

widely employed in an a posteriori framework where the tests of hypothesis 

theory is developed. The scenario is the following: given a population described 

by a random variable X of fixed though unknown parameter 0, we assume 

0 = @ and check this hypothesis on a sample Xm picked at random within the 

population. As Xm is a specification of the random sample Xm, we search for 

statistics s = h(x) that prove meaningful in spite of the variability of Xm 

in whose respect the observed sample is just a partial inspection. Properties 

concerning the expected value of S = h(X,,) or functions of it might denote the 

significance of the statistic and the value of the hypothesis. 


Example 2.32. A quality control problem typical of the sixties has the following 

format: you acquire a huge amount of items of a given kind, e.g. ball bearings — 

call it a lot — and want to check whether the quality of these items is satisfactory. 

For instance the manufacturer tells you they have been produced with a variance Exhibit a statistic 
o7, of internal diameter D less than 0.01 mm?. Call the manufacturer’s assertion Miche T 
null hypothesis Hp. You pick at random a sample from the lot, say 10 ball 

bearings, and carefully measure their internal diameters d;s, and you want to 

base your verdict “Ho is true/Hpo is false” on a condition that is satisfied or 

not by a relevant property of the sample {dj,...,dio}. Call this condition a 

test of hypothesis. Assuming, as commonly assessed, that D follows a Gaussian 

distribution law, you have no doubt that a significant property w.r.t. 77 is the 

minimal sufficient statistic consisting of the sum s = )>)" , (d;—d)? of the sample 

items. Now the strategy is: consider a condition on s such that it is with high 

probability satisfied by S when Hp is true. In this case you will claim that Ho but with small 
is false (reject Ho, with proper terminology) with low probability when Ho is eae 
true. For instance, our decision rule may be 


Hg is rejected if and only if s > ta (2.129) 


116 Algorithmic inference 


where ta is the a-th quantile of S distribution when oĉ = 0.01 so that 
P(S < t,|Hois true) > l—a (2.130) 


Parameter a is denoted as the significativity of this test, representing the prob- 
ability (risk) of emitting a wrong verdict when Hp is true. Its value, which we 
would like to be as small as possible, has to be balanced by a complementary 
risk of accepting Ho when this hypothesis is false, for instance when 07, = 0.1. 
In particular, since S/o} follows a Chi square distribution, with a = 0.05 and 
o? = 0.01 we have that the a-th quantile x2 of this distribution equals 16.9, 
and to.o5 = 16.9- 0.01 = 0.169 so that the test becomes 


reject Ho & s > 0.169 (2.131) 


With analogous computations we have that the probability 8 of incorrectly 
accepting when oł = 0.1 is 0.0045. 


In any case, estimators 6 like T benefit better from the linearity of the expecta- 
tion operator. Indeed, for whatever g(X), the weakly unbiased estimator g of 


Blg(X)] is 1/m X% g(a): 


2 yra(xo] =2e 


i=l 


Sao) = Ð Elg(X)] = Elg(X)] (2132) 


The big property The operational counterpart of this relation lies in (1.38), reinforced by the 
fact that the variance of G decreases with m, getting 0 with m going to infinity. 
Indeed 


(2.133) 
where the last equality comes from the independence of g(X;)’s after the inde- 
pendence of the X;’s (see Fact B.1). On the one hand, this asymptotic property 

Sample mean is is owned also from our strong version of unbiased estimator (as for infinite 

population mea, samples, the two estimators coincide). On the other, minimality of the mean 
square error (2.102) is not guaranteed by the weak definition, even if the expec- 
tation is now taken over G. Only the restricted class of UMVUE (Uniformly 
Minimum Variance Unbiased Estimator)’s may be proved to enjoy this property 
[Rao, 1949]. 


2.3.2.2 The method of moments 


An initial suggestion for finding estimators comes from the discussed contiguity 
of g with E[g(X)]. It consists of substituting in the analytical expression of 
expectation: 1) the integration or summation domain with set(x), i.e. the set of 


Point estimators 117 


all different elements of x, and 2) the probability density or probability function 
with the frequencies of the set(x) items in the sample. This is the root of 
the widespread method of moments for finding sufficiently good estimators of 
parameters, starting from a limited information about the distribution law they 
apply. 

Going back to the moments’ definitions (1.39) and (1.40) in Chapter 1, in 
force of (2.132) and expressing the dependence of the population moments on 
its parameters, we have that 


E|[My] = u (0) (2.134) 


where we expressly highlight the dependency of the moments on a parameter 0. 
Theorem 1.1 states the relevance of these parameters for specifying a distribution 
law. All this suggests exchanging frequencies with probabilities and taking note 
of this by referring to an approximation (estimator) 6 of 6: 


mr = Hr (0) (2.135) 
Thus we take as many independent equations as there are components in 0 and 
get their estimates as a solution of these equations. 


Example 2.33. Consider a sample x drawn from an unknown distribution of 
mean p and standard deviation o. From the definition of the first two moments 
we get 


p(u,o) = u (2.136) 
p(m,0o) = oF +p? (2.137) 


Denoting mı and mə the first two sample moments of x and coupling the two 
issues of (2.135) corresponding to r = 1 and r = 2 we obtain the equations 
system 

(2.138) 
2+ pi? (2.139) 


whose solution gives mı = Z as estimator for u and ym — m?, i.e. Vz? — F?, 


as estimator for o. 


ee 


mı 


Qc 


m = 


The method looks very simple in principle and less demanding in terms of 
knowledge about the boiling up distribution laws. This comes at the cost of no 
guarantee of any property, not even the weak unbiasedness, for the estimator. 


Example 2.34. While we learn from Example 2.33 that for every distribution 
law, the variance estimator ð? obtained with the moments method is 


dei - 7)? (2.140) 


The method of 
moments 


broadly equates 
sample and 
population 
moments. 


An easy way of 
finding non 
guaranteed but 
generally robust 
estimators 


A not very 
discriminating 
method 


Mixtures of 
populations 


But quite a 
available method 


118 Algorithmic inference 


it is easy to prove that E[%?] = (m—1)/mo?. Therefore, the weakly unbiased 


estimator of g? is 
mMm 


1 
a’? = — YX _ (a; —7)" (2.141) 


Example 2.35. Consider a sample drawn from a random variable X distributed 
uniformly over the set [0,0]. As E[X] = g (see Appendix B), applying the 
method of moments in order to obtain an estimator for the mean of X we 
get g = T, so that 6 = 2T emerges as estimator for 0. This estimator is 
different from ~= £(m), which in Example 2.29 was shown to be unbiased. Both 
kinds of unbiasedness are however guaranteed asymptotically, which makes this 


estimator coincide with our unbiased estimator asymptotically. 


Actually the method suffers from the poorness of the information it uses. 


Example 2.36. Consider two independent random variables X and Y, and an- 
other pair of variables derived from the former by mixing either their values or 
their density functions. Namely 


Z=aX +(1-a)Y (2.142) 
fw(w) = afx(w) + (1 — a) fy (w) (2.143) 


The first moment and the second centered moment of them are: 


E[Z] = aE|X]+(1-a)E[Y] (2.144) 

V[Z] = oeV[X]+(-a)V[Y] (2.145) 

EW] = aE|[X]+(1-a)EY] (2.146) 

VW] = aV[X]+(1-a)V[Y] + a(1 — a) (E[X] - E[Y])? (2.147) 
Hence from the equality mı = pı (ð) we obtain 


mı — Ely] 
E[z] — EY] 


a= 


(2.148) 


i.e. a same estimator independently of the true model underlying the data. 
However, we may check the model controlling whether mz — m? is closer to 


a?V[X] + (1—a)2V[Y] or to aV[X] + (1—a)V[Y] + a(1 — a)(E[X] — E[Y])?. 
E 


Thanks to its roughness however, the method represents a robust tool where 
more sophisticated ones fail. 


Point estimators 119 


Example 2.37. Consider a Binomial distribution with parameters n and p. As 
seen before, we have no sufficient statistic for n. We may however find an 
estimator of it from the simple relations: 


mı = np; mz — m? = np(p) (2.149) 
a 


The reader will recall that we referred to sample moments as usual pivot 
statistics for bootstrapping parameters’ distributions. The method we explained 
in this section represents a quicker and approximate way for exploiting the mo- 
ments’ information content when the final goal is just to have point estimators. 


2.3.2.3 The maximum likelihood method 


This method is considered one of the most sophisticated and informative meth- 
ods for getting estimators. Its root is in the concept of entropy of a random 
variable. 


Entropic measures The entropy H|X] of probability distribution Px un- 
derlying a random variable X is a nice way of appreciating the uncertainty by 
which X is affected, as its expression is the following: 


H[X] = E[-In(Px(X))}= $5 —In(Px(2))Px (2) (2.150) 
reEDx 


and analogously for the continuous approximation. We can easily prove that 
H[X] is maximum for X uniformly distributed, going to infinity with the number 
of values X can assume, while it vanishes when this number is exactly one, 
i.e. the variable is non random. A graduation of intermediate values, always 
positive, of H measures the uncertainty about X far from these extreme cases. 
As a further exclusive property, the entropy of the joint distribution of two 
independent variables is the sum of the entropies of the marginal distributions. 

This measure is a basic parameter in the Kolmogorov approach. A gen- 
eral thread in modeling is the maximization of the entropy under constraints 
representing what we know about the probability distribution in a probability 
model. For instance, the uniform distribution is the one maximizing the entropy 
within the family of all distributions with finite range; the negative exponential 
distribution gets the same target when one of the extremes of the range goes to 
infinity, while the Gaussian distribution does the same when both extremes are 
relaxed but the variance is bounded. Vice versa we appreciate the efficiency of 
a statistic in terms of its ability to reduce the entropy of the distribution con- 
ditioned by the statistic. In this sense a sufficient statistic w.r.t. a parameter O 
is the most efficient, after Fact 2.1 for whatever distribution law of © °°. 


36From (2.34) we have that L(x1,...,@mj@|s), i.e. the mentioned conditional distribution, 
exactly equals n(%1,...,%m), that is a function of the underlying {u1,..., Um} free of any 
tunable parameter. Thus on the one hand we have no source of uncertainty, on the other no 
tool for further reducing its entropy. 


A very suitable 
measure of 
uncertainty in the 
Kolmogorov 
approach 


that we . 
appreciate with 
respect to what 
we know from 
the sample. 


In search of. the 
minimal 
uncertainty 


we look for the 
sample maximum 
likelihood. 


We generally 
observe what is 
highly probable 


i.e. we search for 
parameter 
estimates taking 
probability 
maximum. 


120 


Algorithmic inference 


The relative entropy I[X;21,...,%m] is a way of appreciating the gap in 
uncertainty between what we know about the random variable from its sam- 


ple {a1,...,2%m} and what will be revealed by the future population X. Its 
definition is the following: 
I[X;a1,..., £m] = E[ln(Px(X)) — In(II(X; 8))] (2.151) 
having a weakly unbiased estimator: 
“1 Px (xi) 
I[X:21,...,2m] = 2 2.152 
Xansen] = Y5 E ( Ex) (2152) 


where I(x; @) is an estimate of Py (x) deriving from an estimate 6 of a parameter 
of the X distribution law. 


The operational rule The relative entropy, too, ranges from a maximum 
when O and X are independent to 0 when the mentioned uncertainty gap is 0, 
i.e. when I(x;;0) = Px(a;) for each i € {1,...,m}. Therefore a suitable goal 
is to minimize this quantity in order to estimate 0 37. Moreover, since 


m m 


Em] = X In(Px (z:)) — $X n(II (a4; 9)) 


i=l i=1 


mI[X;%4,.. (2.153) 


the goal reads as the maximization of the likelihood L(x; 0) of the sample: 


0) = [Teo 


Since the likelihood is a measure of the probability of observing exactly the 
sample specification in argument (or the related density in case of continuous 
variables), the above goal complies with the following intuitive reasoning. 

Interpreting the probability of an event as its tendency to occur, we expect 
to observe an event because it has high probability. Hence it is reasonable to 
assume that free parameters of the random phenomenon expressing the event 
assume values that render this probability very high, maximal by default. Thus: 


(2.154) 


Definition 2.13. Denote likelihood L(x; 0) of a sample the product 


m 


0) = [[ fxs) 


namely the probability of observing exactly the sample in case of discrete X, 
or the probability density of this event, in case of continuous X. A maximum 


(2.155) 


37Consider that we are in the Kolmogorov framework, hence with a fixed 0, even if from 
time to time we relate our reasoning to a variable O 


Point estimators 121 


likelihood estimator (MLE) 6 of parameter @ of the distribution law of X is the 
solution of the equation: 5 
d= arg max L(x; 0) (2.156) 


where arg max, g(a) denotes the the argument «* maximizing g. 
Oo 


Example 2.38. For a binomial random variable X the estimates of its parame- 
ters p and n are obtained as follows. The likelihood of sample x is: 


m n 
L(a1, ---; Tm; P, n) = Ü (") 


i=l 


p>i=ı “ii p) TE vi (2.157) 


therefore 


e to maximize L w.r.t. p we find the 0 of its derivative, rather more simply 
the 0 of the derivative of its logarithm — as the logarithm is a monotone 
function — obtaining 

Onl Piti mn- Ya Ti 
Op p =o 


(2.158) 


that vanishes for 


1 
p yozi (2.159) 


e to maximize L w.r.t. n we consider the course of the ratio 
L(z1,..., £m; p, n)/L(£1,...,£m; p,n — 1) and identify where it crosses 
1 provided its monotonicity. Namely: 
D(21,.--,L%m3p,n 1—p)™ 
EO O O u (2.160) 
L(z1,..., £m; p,n — 1) ad=) 
that is a function decreasing with n, crossing 1, however, at a value ñ 
representing a non explicit statistic of x. 


MLE is the favorite tool used by statisticians to explore complex models. 


Example 2.39. Assume that the sampling mechanism of Y as a function of z is 
the following: 
Y=at+br+cZ (2.161) 


where Z is a Normal distribution (see Example 2.15). Let us estimate a,b 
and c with the maximum likelihood method. As Y is a Gaussian variable of 
parameters (a + bx, c°), the likelihood of a sample {Yj,..., Ym } is 


: (2.162) 


1 i _1 Dy wi-e—ba4)? 
e 2 2 
27C 


EWA g 5.04 Biss 5.0) ©) = ( 


Looking for 
vanishing 
derivatives 


or for unitary 
incremental 
ratios. 


An elementary 
linear model 


122 Algorithmic inference 


We identify the MLEs as the zeroes of the derivatives of In L w.r.t. a,b and c: 


a = 9-be (2.163) 
§ = Lies Dua (2.164) 
Dini (Gi — 7) 
1 m S 
f = —Y (y,-—&—-ba;)? (2.165) 
uit i=1 
a 


MLE, too, is rooted in sufficient statistics. Indeed (2.34) shows that the sole 
part of the likelihood susceptible to be minimized is the term 7(s,0) and this 
may be done exactly with a 0 as a function of s. Therefore: 

Still a sufficient Lemma 2.5. For a random variable with parameter 0, a MLE 6 is a function 


statistic as main ; IRR 
ingredient of a sufficient statistic. 


Remark 2.8. The above lemma resolves in a trivial sentence when the sole 
sufficient statistic available for a given parameter is exactly the set of joint 
sufficient statistics constituted by the values of the sample (see Fact 2.2). 


Tight relations Fact 2.4. We may easily realize that the MLE of a parameter represents the 
between Mie mode of its distribution law when a sufficient statistic is used in the twisting 
argument to discover this distribution and the order relations on the statistic 
and the parameter coincide. Sharply, starting from the c.d.f. F's, of the statistic 
S w.r.t. the parameter O, we search the maximum on 0 of the s derivative of 
F to get its MLE, thanks to the factorization lemma, and the maximum on 0 
of the 0 derivative of F to get its mode, where the two maxima coincide if the 
monotony relation in (2.30) is based on the same order relation for both s and 


0. 


Plug in method Finding maximum likelihood estimators may prove a quick 
task since the maximum of L(z1,..., £m; 0) is independent of the representation 
of the free parameters on which fx depends. Thus, a maximization solution 
6 w.rt. 9 € © maps to q(9) as a solution of the analogous problem w.r.t. 
6’ € {q(0);0 € O}. Formally: 


A maximum Lemma 2.6. For a random variable with parameter 0, if ð is a maximum like- 
likelihoo : A n z : 
invariance lihood estimator of 0, then for any function q we have q(0) as a maximum 


Poey likelihood estimator of q(@). 


Therefore a way of finding maximum likelihood estimators is through com- 
position starting with more elementary ones. In some sense, we plug estimators 
of elementary parameters into the function computing a more complex function 
of these parameters. 


Point estimators 123 


Example 2.40. On the basis of a sample {t1,..., tm}, a maximum likelihood 
estimator of the reliability R(t) of a plant at time t, i.e. the probability that the 
plant goes out of work at a time greater than t, in case the failure time follows 
a negative exponential distribution law of parameter A is 


R(t) = exp (=) (2.166) 


Indeed, R(t) = P(T > t) = 1 — Fr(t) = e™™ and A = s=. 
E 


The situation is different when we consider joint statistics. Consider the 
above suggestion of substituting probabilities with sample frequencies that is at 
the basis of the method of moments. Actually, the single probability Py (x) has 
a maximum likelihood estimator in the frequency ®x (a) of meeting x within a 
sample. Thus the plug in method would also agree with this suggestion since 
E|g(X)] = ¢(@1,-.-,9x), where N equals the possibly infinite cardinality of Dx, 
and 0; = P(X = 2;) which we substitute with ®x(x;). Joint statistics however 
may suffer from some constraint in search of the maximum of L(a#1,...,%m}0). 
Or, from another perspective they may miss information coming from the shape 
of the distribution law in searching its density maximization on the sample. This 
happens in the following examples. 


Example 2.41. With reference a random variable X uniformly distributed in 
[0,8], and a sample x, the maximum likelihood estimator of the parameter is 


v 


0 = max{z1,..., £m}. Indeed its likelihood may be factorized as: 
1 y 
L(a1,. oe , Lm} 0) = gn 110.01 (9) (2.167) 


The estimator 2 57)", x;/m suggested by a naive implementation of the plug in 
method: 


6 8 1 
E|X] ==- > == — i 2.168 
[ ] 2 2 m A Ti ( ) 
suffers indeed from the fact that not all values of the joint statistics 
{®(x1),...,@(a,)} are feasible, as some might produce an estimate 6 < 
max{x1,...,%m} with the consequence of minimizing rather than maximizing 


the x likelihood (as a counterpart of lack of well behaving property remarked 
for the statistic )>"", x; (see Example 2.1)). 


Example 2.42. Consider a Poisson variable X of parameter u. Since this 
parameter represents both the mean and the variance of X we have two plug in 
estimators, namely 


Í jee i ° 
ar a wv -25°(n- 25s] 
{= i= {= 


But only one, the former, maximizes the sample likelihood. E 


Plug frequencies 
into your 
estimators 


but be cautious, 


still here. 


124 Algorithmic inference 


Methodological Remark 2.9. Note that in the most suitable cases the main ingredient of the 
shortcuts MLE is a sufficient statistic. Hence this method uses in a quick way the statistics 
which in our approach are generally at the basis of the parameter distribution 
law. In this perspective maximum likelihood is a counterpart of the twisting 
argument, just as the method of moments is a counterpart of the population 

bootstrap. 


2.3.2.4 Misunderstanding 


The more we A disappointing condition occurs when in studying a discipline the more you 
observe the less k r bat en oe : : 
we understand: study it the less you understand: either the discipline (at least as explained) is 

wrong or you missed its basics. Your mind may be temporarily out of synch 
with the concepts underlying the discipline. It could be a suitable way for 
better stressing the latter, as we will see when speaking of Boltzmann machines 
in a next chapter. But you must insure that in the long term the probability 
of considering false what is true and vice versa proves extremely low. This is 
simply stated, when we are looking for a given property 0, through the following 
consistency property in the Kolmogorov approach. 


Definition 2.14. For a random variable with parameter 0 an estimator Tm of the 
A suitable latter, obtained from a sample of size m, is consistent if Tm tends in probability 


convergence to 0. According to Sec. 1.2.1 this is denoted as Tm => 0, and means that: 
im Pilm —6|>e)=0 Ve>0 (2.169) 
Formally, this means that: 
Ve >0,VOo >0 Am*:Vn>m* P(|\Tn—-O| >2e) <6 (2.170) 
O 
guaranteed by This property always holds for the weakly unbiased estimator 


unbiaedness l/m i, 9(zi), provided that Y = g(X) has a finite variance, as we can see 


through the following statements. 


Fact 2.5. For any weakly unbiased estimator + pyar Y; of 0 8 
m 


1 Z il 
MSE RZ Ta 


This fact trivially derives from the unbaisedness definition and the applica- 
tion of (B.96) 3°. A first operational consequence is the following: 


=V = Ivy] (2.171) 


38its simplest version, concerning X, hence with E[X] = ux, V[X] = V[X]/m is at the 
basis of most results of conventional statistic theory. 

39Note that, unlike (2.102), here MSE takes the estimator as random variable and the 
parameter as fixed reference. 


Point estimators 125 


Fact 2.6. Whenever 0 = Elg(X)] a consistent estimator of it is + SY", g(x). 


m 


Note, however, that we might have biased estimators for some g, such that 

0 + E[+ 30", g(2,)], still having an MSE going to 0 with the sample size. 

A typical example is the statistic X` ;-; (x; — Z)?/m which, in the case of 
X Gaussian, has an MSE w.r.t. o? going faster to 0 than the MSE rooted 
on the weakly unbiased estimator 57)”, (a; — %)?/(m— 1) (see Example 2.34). 
Vice versa, we may have estimators converging in probability to the parameter 
without having a similar convergence of their MSE to 0. An example that is 
as trivial as it is artificial is represented by the statistic s = 2” if J|; ti = 
land Svy, x; elsewhere. If it is taken from a Bernoulli sample the statistic is 
such that: i) it converges in probability to mp, ii) it has E[S] converging to mp, 
and iii) it has MSE diverging to +00. 

In this regard, we may state a set of sentences allowing us to appreciate the 
convergence rate of 1 sample frequency ® to p in the Bernoulli case and more 
generally of 1/m >," g(X;) to E[g(X)]. We appreciate this convergence in 
terms of a fraction of possible samples with a given size that realizes statistics 
far from the goal parameter more than a given e. This fraction is given by the 
probability of extracting this kind of sample. 


Lemma 2.7 (Markov inequality). For any function q such that q(x) > 0 Va € 


Dx 
P(g(X) > e) < SW) 


Ve >0 (2.172) 


Proof. Simply because 
Ela(X)] = >> a(x)Px(x) > 5 q(x) Px (x) 
xEeDx xzEDx s.t. q(x)>e 


>e. Ð Px(@) = eP(a(x) 2e) (2.173) 


we Dx s.t. q(x)>e 
The proof is analogous in case of X continuous. O 
Lemma 2.8 (Chebyshev inequality). 


viy] 


P(Y — E[Y]| > £) < Ye > 0 (2.174) 


Proof. Simply from the Markov unequality with q(y) = (y — E[Y])?. Indeed 


E[(Y - E[Y)?’] 


P(IY — EY] > £) = P(Y - EY) 22) < =; 


(2.175) 
o 


Theorem 2.3. If both E|g(X)] and V[g(X)] are finite, then Ye, ô > 0 and m > 


Vig X) 
P ( 


E2 


ZYX) ~ Blg(X)] 


> e) <6 (2.176) 


and by sample 
mean in general. 


Consistency, 
unbiasedness and 
MSE may diverge 


What about the 
convergence 
rate? 


A technical 
inequality 


its meaningful 
interpretation. 


A strong 
consistency 
definition 


With large sizes, 
sample and 
population 
commute 


The more we 
observe the 
closer we get to 
the parameter: 
consistent 
estimation 


126 Algorithmic inference 


Proof. Simply by inverting relation (2.174) with Y = + 


m 


21 9(X;), thus hav- 


ing V[Y] = V[g(X)]/m. Oo 
Corollary 2.1. 
MSE(Y, 0 
P(Y -6|>6e) < wee) Ve>0 (2.177) 
Proof. Simply because also MSE(Y, 0) is a q as in Lemma 2.7. m 


Definition 2.15. For a random variable with parameter 0, an estimator Tm of 
the latter, obtained from a sample of size m, is consistent in mean square if 


lim MSE(Tin,0) = 0 (2.178) 


m— +00 
E 


Corollary 2.2. An estimator T is consistent if it is consistent in mean square. 


As we are considering a sample that tends to identify with the population, 
the above sentences read almost identically in our framework. Namely, we deal 
with the entire sequence € of observations we will face in the future. It is 
characterized by a parameter 0 (non random because we refer exactly to this 
sequence). Hence © collapses to 6. You may think of a set of parallel half lines 
representing the possible X stories. As far as the sample observation proceeds 
you squeeze the lines, starting from a common beginning. The more a segment 
is squeezed the tighter is the spanning allowed by its suffix. Until, ideally, you 
reach the infinitely far end; at this point you face exactly € having not yet a 
random parameter © but just a specification of it 6. 

Within the sequence we look at a prefix of a smaller, yet extremely large size 
m and we check if a tm estimator is consistent, i.e. if tm + © since 

lim P(\tm—O|>¢)=0 Ve>0 (2.179) 
m— +00 

You have two problems: i) to identify a statistic t asymptotically converg- 
ing exactly to 0, and ii) to appreciate the convergence rate, with the further 
complication that you will know @ just when you reach the infinitely far end of 
the story. Vice versa you may start exactly from the end and as discussed in 
Sec. 1.2.2, consider the actual initial part of it as a random sample drawn from 
a population of parameter exactly equal to 0. We followed this strategy in this 
section with the understatement that the probabilistic results obtained in such 
a way directly translate into asymptotic statements in our original framework. 
In particular, MSE(Tm, 0) and MSE(tm, ©) tend to a same limit for m— +00. 


Hence if Tm = h(X1,...,Xm) is consistent in mean square, then the limit of 
each specification tm = h(#1,...,%m) satisfies the condition: 
MSE(tm, © 
lim P(ltm —- O| > £) < MSE ©) (2.180) 


After Definition 2.15 and Corollary 2.1 now applied to the random variable 
O, we may also state: 


Point estimators T27 


Corollary 2.3. A statistic t is a consistent estimator if it is a specification of a 
T consistent in mean square. 


Corollary 2.4. T converges in mean square to E|X]. 


As a corollary of the results about consistency, here below we enunciate two 
theorems that hold in both frameworks. 


Theorem 2.4 (Large numbers law). For a Bernoulli variable X of parameter 
p, denoting dm the sample frequency 1/m ar zi, we have that for each e€, 
and m > 1/(4€?65) 


(2.181) 
(2.182) 


Proof. It comes from the fact that: 
1. since V[X] = p(1 — p) then V[®,,] < 0.25/m, 


2. since P follows approximately a Beta distribution with parameters (Mmm + 
1,m— Mom), then V[P] < 0.25/m] as well. 


O 


The large numbers law supplies a tool for quantitatively controlling the as- 
sertion with which we initiated our discussion to give probability an operational 
meaning, namely: “frequencies converge to probabilities”. Thus we close the 
circle of our reasoning by generating within the theory itself a tool for control- 
ling how the theoretical models match the observations. The law refers in its 
general statement to a generic sample mean of a variable X (not necessarily of 
a Bernoulli variable) binding in a similar way with e and ô its drift from the 
expected value for an m deriving from Chebyshev inequality. Namely 


Corollary 2.5. For a random variable X of mean u and variance o°, and a 
y H ; 


sample of X of size m and sample mean Zm we have that for each €,d and 
m > o? /(€75) 7 
P(|u— Xml >) <4 (2.183) 


The analogous lower bound on m to get P(|M — Z| > ©) < 6 is not so 
general, depending on the M distribution law which does not benefit of Fact 
2.5. Convergence like (2.181) is referred to as a weak form of the large numbers 
law. In its stronger form this law limits to a grater extent the occurrence of 
drifts from the expected values. 


2.3.3 Adequacy of the sample size 


A common operational use of the parameters distribution law (rather of the con- 
fidence interval in the Kolmogorov approach) is to determine the size of a sample 
sufficient to guarantee the gauging of the unknown parameter in a satisfactorily 


The observations 
- modeling 
welding point 


Still more in 
general 


The sample size 
problem: how 
long must I 
observe a 
phenomenon to 
be familiar with 
it? 


A quick solution 


not widely 
exportable, 


except if you 
have suitable 
approximations, 


128 Algorithmic inference 


tight interval with an assigned confidence. Thus a corresponding problem in our 
leading example is: “how many messages must I screen to gauge the probability 
of a junk message arrival with an accuracy *° of 10% with a confidence 0.900?” 
This problem has an almost immediate solution in the Kolmogorov approach in 
the cases where a solution exists. 


Example 2.43. Assume the file you are considering to be a sequence of specifica- 
tions of independent identically distributed Gaussian variables X of parameter 
u and o. Having o for given, how large must the sample be in order to have 
an estimate M = X shifting from u no more than e with probability > 1 — ô, 
i.e. such that P(|u — X| < e) > 1— ô? Since Z = (u — X)/(a/./m) follows a 
Normal distribution law, we simply obtain m as the solution of equation 


€ 


21-6/2 = o/m (2.184) 


getting 
o 2 
m= (41-527) (2.185) 


For instance, when e€ = 6 = 0.15 and g? = 0.1 we get 10 as minimal value for 
m. 


We got a solution since we relied on the pivotal quantity ITE This solution 
is particularly easy to find because you compute the quantile once and then 
solve a simple algebraic equation. The same facility occurs with a few families 
of random variables. With a Gamma variable X for instance we have that 
Sai X;/X follows a Chi square distribution with a parameter independent of 
A modulo a scale factor, as discussed in Example 2.16. Otherwise, to maintain 
this ease we often try to approximate the statistic’s distribution law with a 
Gaussian one. 


Example 2.44. Assume the file you are considering to be a sequence of spec- 
ifications of independent identically distributed Bernoulli variables X of pa- 
rameter p. How large must the sample be in order to have an estimate 
P=X shifting from p no more than e with probability > 1 — ô, i.e. such 
that P(|p — X| < £) > 1 — 6? In the approximation of aaa with a Normal 


m 


variable Z, in force of Theorem 2.3, and majorizing p(1 — p) with 0.25, the 
problem has the solution m > (2z) 


If the approximation is unattainable, possibly because the sample size is too 
small, we may always rely on the Chebyshev inequality (2.174). 


40 where the accuracy of the interval (x1, £2) is defined as the value £2—*1, 


Point estimators 129 


Example 2.45. Using Corollary 2.1 and the above majorization on p(1 — p), we 
get the broader solution of the problem in Example 2.44: m > = 


In our approach we exploit pivotal quantities at their root, “inverting” their 
distribution laws w.r.t. m. The situation is actually a bit more complicated, as 
you will see from the following examples. 


Example 2.46. Assume the file you are considering to be a sequence of specifi- 
cations of independent identically distributed Bernoulli variables of parameter 
P. Wanting to gauge P within the interval (0,0.08) with a confidence 0.900, 
how many bits of the file must I examine? The answer to this question straight 
comes by solving equations like (2.64) with fixed 6, = 0.08 and 6 = 0.1, k (the 
number of observed 1’s) as parameter and m as variable. Relying on the left 
inequality in (2.20) for approximating Fp, the curves in Fig. 2.28 connect m 
with k. From these curves you see that the number m of the bits to be observed 
grows steeply with k. You may appreciate more properly this inconvenience 
looking at the part (b) of the figure. Here the abscissa is the sample mean 
T = k/m, thus indicating that, starting from a given 7, if this value does not 
change too much, as expectable, in the further extractions, the sample size my 
required by your inference is the intercept of the curve by a vertical through 
T. Rather, you see a total of mz bits and estimate an upper bound s = 0.08 
if the sample mean remains almost unchanged, otherwise you must update the 
right extreme of the confidence interval with the methods studied in Sec. 2.2.1. 
Moreover, with a bad starting, i.e. with an T close to 0.08 you must plan to 
examine a long series of bits. 

A more sounding ~ since operationally easy — way of formulating the problem 
is the following. Given a file as in the previous example, how large is the number 


m of bits must I examine for gauging P in an interval of width 0.080 around d 


Ł with confidence 0.900? This question is more sounding as I know that P is 
well distributed around Ł and converges in probability exactly to this value as 
m — +00. Indeed from the solution of the equation 


k k 
Fp (= + 0,040) — Fp (= = 0.040) = 0.900 (2.186) 
m m 


approximating the former c.d.f. from below through the lower bound and the 
latter from above through the upper bound in (2.20) we obtain the curve of Fig. 
2.28(c) that exhibits a request for sample size that is lower than in the previous 
case and looks feasible for any value of 7. 

Moreover we may relieve the load of the involved Beta distribution c.d-f. 
computation when m is large by approximating the variable K with a Gaussian 
p-p) 

m 


variable with mean p and variance as mentioned in Example 2.44. In this 
case we find solution of (2.186) from the following equation 
—0.04 
£0.05 = — (2.187) 
(£ +0.04)(1— (£ +0.04)) 


still : 
corresponding to 
twisting quantiles 


Fixing a priori a 
confidence 
interval is more 
costly 


than adapting it 
to the observed 
ata, 


for m large, still 
easier. 


130 Algorithmic inference 


x 


0.02 0.04 0.06 0.08 0.1 0.12 0.14 


(c) 


Fig. 2.28: Course of the sample size m sufficient to insure a confidence 0.900 for the 
parameter P of a Bernoulli variable. (a) m vs. the numer k of ones and (b) m vs. a 
frequency T = k/m when the interval extremes are (0, 0.080). (c) same as (b) when 
the interval has width still 0.080 but is centered around 7 (lower curve) contrasted 
with the previous curve (for numerical reasons lower curve starts from 7 = 0.030). 


Figure 2.29 shows the course of the solution found for m as function of k, gauged 

from below by the exact curve computed in Fig. 2.28(c) and from above by the 
upper bound we find if we approximate (È + 0.04)(1 — (4 + 0.04)) with its 
maximum equal to 0.25, obtained for p = 0.5. The small advance of the contact 
point of the two curves before p = 0.5 is due to the approximation of implicitly 
considering the variance of Æ equal to (£ + 0.04)(1 — ( + 0.04)) in place of 
(+ — 0.04)(1 — (4 — 0.04)) also in correspondence of the P lower bound. This 
has been done in order to better exploit the symmetry of the Gaussian variable, 
having that Fz(z) = 1— F(—z) hence Fz(21~5/2) — Fz(25/2) = 1—2Fz(z5/2) = 
1 — ô, hence equation (2.187). 


Focusing on k rather than on a normalized issue of it (like 7) becomes mis- 
leading in the initial question about your mailbox that indeed results ill posed. 
Indeed: 


131 


Point estimators 


Fig. 2.29: Gaussian approximations of lower curve in Fig. 2.28 for large m. The lowest 
curve is the original one solving (2.186); the intermediate one comes from solution 
of (2.187); the horizontal line corresponds to majorizing the variance of % with its 
maximum = 0.25. 


Fig. 2.30: Decreasing of the confidence 1 — 6 of the interval of width 10%% around 7 
with increasing sample size m. 


Example 2.47. If we consider the course of Fp (4 + 0.054) — Fp (£ — 0.054) 
with m, as done in Fig. 2.30 for k = 10 we see that the confidence of the interval 
(£ + 0.054, = — 0.054) decreases with m. This is not astonishing if we realize 
that for fixed k any shift of m from it has the twofold effect of decreasing the size 
of the interval on one side and increasing the variance of ps (the right extreme 
of the confidence interval) as we are moving away from the 0 variance prefix 
where £ 1. From an operational perspective this says us that we must rely 


m 
on the normalized statistic Ł for sizing our confidence interval, as done in the 


previous example. 
E 


The same procedure may be applied to the parameters of other distribution 
laws, for instance: 


Example 2.48. Willing estimate the fault mean time between failures + of a kind 
of electric engine, we know that the value of this parameter within commercial 
technology is of the order of 10° hours. Thus with same notations as in Example 
2.16, we are interested to an upper bound on the parameter Ł that moves 


No way without a 
suitable | 
normalization of 
the statistic 


A proper 
normalization is 
the sample size 
itself 


Same method in 
principle for every 


distribution law 


132 Algorithmic inference 


eop m” 


T 
37.5 40 42.5 45 47.5 50 52.5 55 2x 103 


Fig. 2.31: Course of the sample size m sufficient to insure a confidence 0.900 to the 
interval (0,7 + 2- 104) for the parameter 1/A of a negative exponential distribution 
law. 


from = no far than 10° hours. Like in the mentioned example we assume the 
time between failures to follow a negative exponential distribution, thus we can 


exploit the relation 
1 s 
— = ~ (2.188) 
Ài Xmas 
where we explicitly mention the degrees of freedom of the y? variable in the 
subscript of x, for drawing the curve m, = in Fig. 2.31 for 6 = 0.1. 


2.4 Bibliographical notes and further readings 


The leitmotif of this chapter is the study of the population parameters’ distri- 
bution law. It is always hard to convince the matematically cultivated reader 
that we are not dealing with an exotic variant of the Bayesian approach to in- 
ference [Florens et al., 1990] and that we neither miss nor understand a prior 
distribution for these parameters. We simply do not need this kind of prior. As 
mentioned in Chapter 1, here we have a deterministic prior represented by the 
computational context, which we couple with random seeds to build up sampling 
mechanisms. We typically use unitary uniform seeds, as customary in cryptog- 
raphy, or also other kind of seeds, for instance standard Gaussian variables, as 
the totally blind part of our sampled data. Starting from relevant properties of 
these data we look for compatible population to which to attribute them. This 
is the key operation of our inference framework and it may prove heavy for it 
requires the solution of inverse problems either numerically or analytically in 
the best of cases. Once this step is passed we have the parameters’ distribution 
law and we find estimators as solutions of plain direct problems. 

As stated earlier, we are not claiming that ours is the best framework. It is 
quite suitable for computing confidence intervals, hence for learning, as we will 
show in the next chapter, also in cases where conventional approaches have no 


Bibliographical notes and further readings 133 


clean solution tools [Meeker and Escobar, 1995]. Vice versa tests of hypothesis 
like in Example 2.32 represent the favorite operational field of classical statistical 
theory. Here the probabilistic framework, say the random variable distribution 
law, is completely defined and we study properties that are only seldom falsified 
by samples of this variable. This is the quintessential definition of statistical 
test given by Martin-Lof [Martin-Lof, 1966] for capturing the basic features of a 
testing algorithm outputting “true” / “false” having a random sample in input. 
Thus if you say “the probabilistic framework you hypothesize is false” (for in- 
stance, the mean of a given variable does not have the value you fixed in the 
framework), when the algorithm outputs “false” you state a wrong sentence only 
seldom as well. The optimality of the algorithm and the probability of being 
wrong depends on the distribution law of the output, i.e. of a variable that is a 
function of the drawn sample, hence of the prefixed probabilistic framework in 
turn. The reader may find a lot of precise and readable books on the matter, 
seeing for instance [Wilks, 1962, Rohatgi, 1976, Ross, 1987]. A corollary of this 
approach is a point estimator theory where we may consider prominent features, 
such as mean value or standard deviation, of the distribution law of the above 
output or rather of some statistics constituting intermediate outputs in this cal- 
culation. Depending on the values of these features the statistics are candidates 
for substituting the parameters characterizing the probabilistic framework once 
the latter have been hidden [Neyman and Pearson, 1933, Kendall and Stuart, 
1961]. A typical criticism of this approach comes from a friend of ours working 
in the epidemiological inspections. Having 20 deaths from cancer imputable 
to polluting agents, he is obliged to throw out these deaths since they are not 
enough to contrast with the favorable probabilistic framework (no pollutants 
in operation) people have hypothesized in their defense [Micheli et al., 1999]. 
This is why, in accordance to the paradigmatic sentence: “absence of evidence 
is not evidence of absence” [Altman and Bland, 1995], over the last 20 years 
leading medical journals require the adoption of confidence intervals comple- 
menting significance tests as a standard part of presentation of the quantitative 
results of studies [Altman, 2005]. In our approach indeed we do not waste any 
data. We pivot exactly on the data we have available for formalizing in terms of 
pairs (properties on data, properties on the future population) the knowledge 
we have acquired about the observed phenomenon. The common randomness 
of the two properties is localized again in a Kolmogorov framework that now 
has the main function of acting as a default uniform random number generator. 
This is how we recover the rich bulk of theoretical results of probability theory, 
now utilized for giving an operational counterpart to the logical implications 
connecting the above property pairs. Within this framework we overcome some 
puzzles the mentioned books use in dealing with the confidence intervals and 
give a rationale to some attempts for considering the randomness of the frame- 
work parameters as formulated in [Zubrzycki, 1972, De Finetti, 1975] or in the 
statistical approach to the confidence interval enunciated in [Mood et al., 1974]. 

A well formulated theory of confidence intervals is at the basis of the mod- 
ern learning theory [Valiant, 1984] that will constitute the subject of the second 
part of this book. Results on point estimators represent a corollary of the the- 


134 Algorithmic inference 


ory, where we find other features than quantiles of the parameters distribution 
laws. A key role is played in both interval and point estimators by the sufficient 
statistics, possibly in their weak form of well behaving statistics. They are well 
studied statistics also in the classical approach due to the elegant property of 
substituting the unknown parameter in the sample conditional probability. In 
our framework they have the non replaceable operational role of registering the 
sensitivity of the sample data to a change in the parameter of the distribution 
law describing their specifications, and they have this capability just because 
are logically well founded computer programs. This is the distinguishing sieve 
of our approach for understanding data; connections between computational 
and probabilistical aspects of distribution laws are however widely discussed in 
[Li and Vitanyi, 1997]. Once the logical aspects have been assessed computing 
confidence interval is just a matter of finding the selected statistic distribution 
law. We may succeed it either in an analytical — the quick option — or numer- 
ically through a revised and definitely more powerful version of the bootstrap 
methods introduced by Efron [Efron, 1982, Efron and Tibshirani, 1993]. 

Our point estimators are in general structurally different from the classical 
ones, since the underlying distribution laws are different. We may come back 
to the latter, which are more easily checkable estimators in general, either for 
large sample size, where the large numbers law states the liaison between the 
approaches, or when we look for preliminary possibly broadly approximate re- 
sults. An in-depth discussion of large number and central limit theorems, as 
welding point between data and models in both approaches, (for instance with 
a distinction between weak and strong enunciation of the statements) can be 
found in classic yet basic books such as [Feller, 1960, Cramér, 1958]. 


Part II 


Machine learning applications 


135 


3 — Computational learning 


In the previous chapters we learned how to quantify the disappoint about our 
mailbox due to junk messages. We may get an accurate appreciation of the 
problem, possibly in relation to boundary conditions such as the hour of day, 
the day of week or the week of year. But our true target is the removal of this in- 
convenience, perhaps through an automatic recognizer and eraser of unsolicited 
messages. In Sec. 1.3.1 we sketched a similar recognizer for spam messages 
based on a scoring system and a thresholding decider described in Table 1.6. 
The matter is how to assign the scores and set the threshold, which amounts 
to drawing a domain c in the feature space containing all representative points 
of spam messages. Having this domain, we can locate every future message 
either inside c and scratch it, or outside and maintain it for a further reading. 
Computing an approximation h of c on the basis of the mailbox’s past history is 
again an inference action that we call learning due to its peculiarity consisting 
in: 


e richer observations, since we simultaneously record both the feature of the 
message and its attribute of being a spam message or not; 


e functionally dependent observations. With reference to Fig. 1.12 we in- 
terpreted the stochastic connection between two variables that count the 
random points falling in the two small domains A and B, respectively, in 
terms of the intersection between them. Here we consider another depen- 
dency by focusing on a single domain, say c, and focusing on the function 
that exactly discriminates the membership to it. 


It is precisely this function explaining the belonging to c of a point in terms 
of its coordinates, rather than the one explaining the coordinates themselves, 
the object of our inference. Of course, we do not expect to know it exactly, 
since we just saw a prefix of a string of data still non existing, and this function 
reveals itself simply to label past messages as junk or not junk. Thus we try 
to manage the uncertainty of future data by considering the distribution law of 
some relevant parameters of c, typically the probability measure of the symmet- 
ric difference between it and h, where h constitutes a statistic we compute to 
approximate c. In other words, we study the distribution law of the probability 
that a future message is qualified as junk by h when on the contrary it is of 
interest to us and vice versa. 


137 


138 Computational learning 


But who computes h? Till now the most complex statistic we have computed 
was either a linear combination of some rescaling of the observations or their 
maximum (or minimum) as coded by lemmas 2.2 and 2.3. Here is another 
matter. We need to state a relation between some measurable characteristics 
of a message and its boredom degree, and we cannot do it without a good 
suggestion, what Artificial Intelligence scientists call an inductive bias [Mitchell, 
1997]. Namely, in order to discover the distribution law of the message length 
the bias is constituted by the large numbers law, telling us that its shape is well 
approximated by a histogram of a large set of observed lengths. But guessing the 
relation between length, font type and the other parameters in Table 1.6 with 
the message boredom is just like gazing at a crystal ball if no insight comes to 
us. We may imagine that the bias comes from a teacher who knows the problem. 
Rather we may figure that, in turn, the teacher organized his experience on the 
basis of external factors such as the neural circuits of his brain or the software 
library of his computer processing the data. We will consider both the bias 
sources, speaking of symbolic learning in the latter case (in the rest of this 
chapter) — when the teacher means the class of functions where to find the most 
appropriate one for explaining the observed labels — and of subsymbolic learning 
(in the last chapter) — a typical attitude of our brain that comes into play when 
we speak of intuition, creativity and so on. The distinction between the two 
learning modes is not so sharp; indeed Chapter 5 will deal with an intermediate 
way of learning as well. 

As we want to work with true statistics we must consider two further proper- 
ties of h: its computability and its understandability. Roughly speaking, finding 
a reasonable explanation to an even small set of observed cause-effect pairs may 
prove a very hard task even if the effect is a Boolean variable like the member- 
ship to c. It is a matter of adapting a function to fit these labels, a problem 
that in its generality belongs to the very difficult class of NP-hard problems !. 
In any case when we design a learning procedure we must pay attention to the 
amount of computing resources necessary for its implementation. Furthermore, 
we want to understand what we learnt or at least we want to manage easily the 
inferred h. This means that if h needs a dossier of some hundreds pages for 
its description we will probably never use it in the future and, in any case, it 
will not help us to understand the underlying cause-effect phenomenon. On the 
contrary, all useful formulas boiling up in our mind take at most ten symbols 
to be enunciated [Piaget, 1969]. 

We are often willing to sacrifice the accuracy of h in explaining the labels 
in favor either of bearable space and running time for computing it or of actual 
usability of the discovered function for understanding and predicting the future. 
Thus jointly with the quality of the bias, two more veins will traverse the learning 
bulk in the last three chapters of this book: 


e the computational complexity of the learning task that we negotiate with 
the detailness of c that constitutes a function to be learnt in a framework 
called structural risk minimization [Vapnik, 1995], and 


? 


1See Appendix D for a short introduction to the computational complexity hierarchy. 


Learning Boolean functions 139 


e the simplicity of the learnt h, whose attainment is so important that we 
may even renounce explaining some observations (rather admitting that 
their coupling cause-effect is not so sharp) in a framework of fuzzy set 
theory. 


We will learn to face this lack of accuracy that we either introduce intention- 
ally or, dually, bear because of the poorness of the observed examples, sometimes 
wrong, sometimes not well observed. 


3.1 Learning Boolean functions 


We operationally resume the above discussion referring to a learning algorithm 
as a procedure for generating an indexed family of functions hm within a class, 
with probability Perrorm of producing wrong outputs converging to 0 in probabil- 
ity. The convergence goes with increasing m, the size of the available example 
set of how the target function should compute. In this chapter we focus on 
Boolean functions, i.e. we restrict their output to be either 0 or 1. For instance, 
in Fig. 3.1 the domain of the functions is the cartesian plane; the learning task 
is to identify one particular circle c within the class C of all possible circles in 
the plane. This might be a mathematical model for identifying the site and the 
emission range of a source of radiating pollution, such as noise, X-rays, etc. in 
a flat isotropic region?. Our examples might be identified with a set of ran- 
domly distributed monitoring stations. The i-th station is described fully by its 
position x; in the plane, together with a {0,1}-valued label telling us whether 
pollution is detected above a given threshold by the station. We are concerned 
with the probability that Mr. John Smith is exposed to radiation, assuming 
the inhabitants to have the same distribution as the set of monitoring stations. 
Thus, 


e we decide that the target function is a circle — our inductive bias; 


e we solve the non trivial computational problem of drawing a suitable circle 
on the basis of the examples, for instance a consistent circle — i.e. a circle 
not contradicting the observations, thus containing all polluted monitoring 
stations and excluding the others °; 


e we refer the accuracy of the hypothesis not directly to the portion of region 
which is misclassified — the part subjected to pollution yet declared safe 
by the authority on the basis of above monitorings and vice versa — but 
to the probability that Mr. Smith lives in this region. Thus it is irrelevant 
that the region where h classifies differently from c is large, in any metrics, 
if no Mr. Smith will stay there. 


We skip the first two items for now and focus on the third in the next section. 


2The reader may note that we are shifting to a much simpler learning problem; for the 
hardest ones she/he must wait the end of the book. 

3Consistency is a key condition in learning. As we will see later, its satisfaction ensures 
meeting the consistency property in Definition 2.14. 


A consistent 
estimator of a, 
Boolean function, 
rather than a 
parameter; 


for instance of a 
circle in a plane. 


We: decide the 
shape 


identify a 
function with this 
shape 


appreciate 
benefits and 
drawbacks of our 
selection. 


Once again, it is 
matter of a 
Bernoulli 
variable, 


but more 
complex to be 
monitored 


X = 1 if we hit a 
target 


We succeed if: 
either we are 
skillful or the 

target is much 


large 


140 Computational learning 


Fig. 3.1: The learning framework: Æ: the set of points belonging to the Cartesian 
plane; bullets: 1-labeled (positive) sampled points; diamonds: 0-labeled (negative) 
sampled points. h: a circle describing the sample; ci: possible circles describing the 
population. Line filled region: symmetric difference between h and co. 


3.1.1 Error measure distribution law 


When Perrorm coincides with the probability that a future point falls in the 
symmetric difference between c and h, discovering its distribution law involves 
a relation similar to the one between sample and population properties of a 
Bernoulli variable discussed earlier. Of the two inference methods, bootstrap- 
ping of populations and twisting argument, introduced in Chapter 2, only the 
latter will be used in this chapter, since the information we will rely on the 
sampling mechanism is too poor to enable a bootstrapping procedure. With 
reference to the logical implication (2.14), here we need more than one sampled 
point to recognize that the probability measure of the error domain is less than 
a given €. 

Let us proceed step by step and consider a generic sequence of random 
variables Y’s in a space Y (explained by a proper sampling mechanism 4); for 
any fixed subset c C Y (call it concept), it is possible to refer to the random 
variable 

vak if Y Ec aa) 
0 elsewhere 


This is another way of explaining a Bernoulli variable whose behavior is identical 
to that of X introduced through (2.1), provided that P(U < p) = P(Y € c). 
The situation is different if c too is unknown. In this case we have two ways 
of moving p to p in (2.14) starting from an assigned shape h (call it hypothesis) to 
c. We can both modify the probability of falling into h and enlarge h. To extend 
right implication in (2.14) we focus on the second way. Thus, for a whatever 
fixed parameterization of the Y distribution law, we obtain a similar picture as in 
Fig. 2.4 for X’ considering an increasing sequence B(h) = Bı C By C B3C... 


Learning Boolean functions 141 


Fig. 3.2: A nested sequence of domains (dashed circles) containing a given domain h 
(solid circle) in R?. For any sequence of random points, moving from B; to Bi+ı both 
numbers of included points increase: those belonging to the sample and those to the 
population. 


of subsets of Y pivoted around h in the sense that h belongs to B(h). For the 
mere sake of simplicity, in Fig. 3.2 we assumed B; to have the same circular 
shape as h, since this shape is immaterial. Moving from one element of B(h) to 
another corresponds to raising or lowering the threshold line in Fig. 2.4, with 
identical an implication chain in regard to the number of 0’s and 1’s in sample 
and population (namely, if the number of 1’s observed in the sample increases, 
then the number of analogous elements in the population will increase, too). 

Now, let us refer h to the set of uniformly distributed variables in input 
to the function explaining Y (for instance assume Y = R? and a pair (U1, U2) 
generating the coordinates (gy, (U1), goa (U2)) of Y). In this framework we can 
reconsider the circle sequence in the picture in Fig. 3.2 as mappings of the same 
h with changes of parameters 3? = (1, V2) increasing P(Y € h). Thus, if we put 
together the two ways of augmenting p, the condition p < p implies the existence 
of a domain of exactly measure p including h. Indeed we may figure that with 
fixed V we enlarge h in h so that P(Y € h) > P, then we possibly modify 3 
to reduce the probability to exactly p 4 (in case of bias on the discreteness of 
gə we could not meet exactly p, this means that we must look for a discrete 
distribution of © as well). 

In the next section we will face the problems of 


1. how to recognize that h is the pivot of a sequence having among its ele- 
ments a B; of measure p including h, and 


2. how to relate the necessary condition to a property of the sample. 


4You could also look for a reduced h with P(Y € h) = p, but in this case you have no 
guarantee that the number of points falling inside is no lower than those falling in the original 
hi: 


Said in other 
words: circles in 
Fig. 3.2 trigger 
either twisting 
argument on p, 
on c, or both 


We must 
recognize and/or 
exploit the 
triggers 


142 Computational learning 


Both questions find a solution in the fact that the hypothesis h we make on 
cis a statistic, hence the output of a special algorithm having in input a sample 
of the pair of variables Y and X’. 


3.1.2 Learning rectangles 


Let us start with h belonging to the class of rectangles. We move from the one 

dimensional case of Fig. 2.4 to the the two dimensional case depicted in Fig. 

3.3 (a)? . Here again we give label 1 to the single coordinate uj, j = 1,2 (each 

ruled by a corresponding uniform random variable U;) if it falls below a given 

A product of two threshold p,, label 0 otherwise. Moreover we give to the point a; of coordinates 
Bemal (u1 U2) a label equal to the product of the labels of the single coordinates. 
(hencea Thus the probability pe that a point a falls in the open rectangle c bounded 
sane by the coordinate axes and the two mentioned threshold lines (for short we will 
henceforth refer to these rectangles as b_rectangles) receiving label 1 is pipz. Let 

but: us complicate the inference scheme in the two ways mentioned in the previous 


section. 
an intermediate 1. We move from U; to the family of uniform random variables Z; in [0, J] 
sampling ; : : 
mechanism explained by the function z = Ju with V € (0, +00). 

generate the 

Jadables and 2. We maintain the same labeling rule but do not know c, i.e. the thresh- 
vale (hones ai olds pı and pz. Rather, within the class of b_rectangles containing all 

are unknown. 1-labeled sample points yet excluding all 0-labeled sample points (consis- 


tent b_rectangles as statistics), we will identify it with a maximal one (call 
it h), i.e. a one that cannot be included in another element of the class 
(hence having edges just before a pair of closest 0-labeled points). Letting 
pi and py be the length of these edges, we presently look for the probabil- 
ity pn = pp / 9? representing the asymptotic frequency with which future 
points (generated with the above sampling mechanism for any #) will fall 
in h. 


To exploit the above arguments about the abstract B(h) sequence we must 

realize that this sequence is actually pivoted around the hypothesis h we have 

built through our algorithm. Said in other words, we may imagine a whole 

family of B sequences, each pivoted on a possible rectangle. In some sequences 

the domain Bs of measure p will include the pivot in other ones is included by it. 

Witnesses of a Thus we need witnesses that for our actual h computed from the actual sample 
ieee the pivot is included in Bs. But this happens if two special points — in Fig. 3.3 
PEEING topalogy (a) exactly the marked negative one preventing the rectangle expanding on the 
left and the marked negative one preventing the rectangle expanding on the up 

— are included in By. Indeed, only one point is necessary if we already know how 

the sequence is pivoted around h, as it happens for h being a circle with the same 

center like in the sequence in Fig. 3.2. In our case, having one point binding the 

h vertical edge is not sufficient for ensuring that the h horizontal edge does not 


5We will use this figure both as a true bi-dimensional instance and as a projection from a 
four-dimensional space. 


Learning Boolean functions 143 


trespass the Bs contour. Thus let us enrich the family of sequences having for 
each rectangle and each possible pair of witness points the pivot constituted by 
the union of the rectangle with these points. In respect to the sequence pivoted 
on our actual rectangle and witnesses we have that if 2 or more negative points 
are included in Bg for sure the witness points are among them. Hence a twisting 
argument reads: 

(pn < P) = (kz > k +2) (3.2) 


where k is the number of 1-labeled points, kp is still a specification of a Binomial 
random variable of parameters m (sample size) and p, accounting for the sample 
points contained in B5. 

From the left, let us consider the family of brectangles. As VJ is free, (pa < P) 
requires that there must be an enlargement of h whose measure for a proper J 
exactly equals p. But since both edges of h are bounded by a negative point, 
this enlargement must contain at least one point more than A itself. Formally 


(kg > k+1) = (pn < P) (3.3) 


Putting together the two pieces of twisting argument, we obtain the corre- 
sponding bounds on probabilities as follows: 


P (Kp > k +1) > P (Pa <p) = Fp, (P) 2 P (Kp > k + 2) (3.4) 


Generalizing our arguments to an n-dimensional space, the results of Sec. 2.2 
extend to the following: 


Lemma 3.1. Let Y = R”, ((Y;, X;),i=1,...,m,...) be an infinite sequence of 
pairs of random variables drawn from the Cartesian product space Y x {0,1}. 
Let us assume that for each specification of the sequence a b_rectangle c exists 


in Y such that 
1 fy; 
T l a (3.5) 


0 otherwise 


for each i, and h denote a maximal b_rectangle with respect to the sample con- 
stituted by the first m elements in the sequence. For k = Y; £i, a symmetric 
confidence interval of level 6 for Py = P(Y € h) ts (lils) where li is the 6/2 
quantile of the Beta distribution of parameters k +1 and m — k, and ls is the 
analogous 1 — 6/2 quantile for parameters k +n andm—k—n-+1. 


Proof. It is easy to realize that in this more general case at most n witness 
points are necessary, for whatever probability distribution on Y °: so we just 
rewrite (3.4) in terms of a Beta distribution law after substituting 2 with n in 
the right implication threshold: 


Ig(k +1,m—k) > Fp, (a) > In(k+n,m—k—n+1) (3.6) 


and use the same arguments of Sec. 2.2. O 


6The interested reader can see a more detailed proof in [Apolloni and Chiaravalli, 1997, 
Apolloni and Malchiodi, 2001]. 


The probability 
increase must 
include at least 
one more sample 
point in at least 
one population. 


Once again is a 
matter of Beta 
distribution 


Same shape of 

_ confidence 
regions, but wit 
different 
parameters 


Controlling the 
_measure of 
special domains: 
a key action for 


any learnin; 
targe 


Principal 
ingredients: a 
domain that 
depends on the 
sample points, 


144 Computational learning 


The Bernoulli variable X describing the label of questioned points in the 
previous lemma comes from the composition of two explaining functions: the 
former mapping from the random seed (U;,...,Un) to the point Y € R”; the 
latter from Y to its label. In this paper we will refer to results concerning X 
distribution that hold for whatever Y distribution. Even the number n of the 
random seed components is disregarded, as it is overridden by other complexity 
indices of the inference problem that take into account how Y components 
interact to compute the labels. Of course a unitary uniform distribution random 
variable U is not susceptible to any prior assumption. 

We extended the coverage experiment shown in Fig. 2.23 to the new frame- 
work as follows. We consider 20 sampling histories. Each sample is made of 
m = 30 labeled points of a four-dimensional continuous unitary cube. We chose 
this dimension to mark the drift from the Bernoulli variable, whereas the rect- 


h angle in Fig. 3.3(a) represents just a projection in a two dimensional space. To 


consider a non flat probability distribution the sampling mechanism of the j-th 
coordinate has explaining function g(u) = uf 7. Then, to figure out a wide set 
of possible labeling mechanisms (3.5), we stored the labels attributed to sam- 
ple points from all b_rectangles within the unitary hypercube with vertices in 
a grid of edge 1/10 on each dimension. Then we forgot the source figures and 
for each labeling computed a maximal consistent b_rectangle h and we drew 
the graph in Fig. 3.3(b). Namely, for each sample and each labeling according 
to the above rectangles we reported on the graph the actual frequency ¢ and 
the probability pa (analytically computed on the basis of the rectangle edges) 
of having a point in Y belonging to the guessed maximal hypothesis. On the 
same graph we also reported the curves describing the course of the symmetric 
0.900 confidence interval for P, with the observed frequency of falling inside the 
rectangle h according to Lemma 3.1 with n = 4. Finally, for the sake of com- 
parison we also drew the curves obtained from Lemma 2.19 for a pure Bernoulli 
variable. In spite of some apparent unbalancing in the figure, the percentages of 
points falling out of upper and lower bound curves are approximately equal, 3.32 
and 3.67% respectively. Thus they satisfy the 5% upper bounds (corresponding 
to 6/2, where 6 is the confidence level of the interval) used for drawing these 
curves. The analogous percentages, 16.2 and 0.28%, denote the inadequacy of 
the curves drawn for the Bernoulli distribution. The smaller values of actual 
versus allowed bounds trespassers can be attributed to the worst-case duty of 
our curves: i.e. they must guarantee a given confidence whatever the underlying 
distribution law is. 

The above check on domain measures is the key action of any learning task 
in our framework, where confidence intervals like the ones in Fig. 3.3(b) are the 
ultimate probabilistic learning target. Indeed, the distinguishing features of the 
above study case are the following: 


e we are building a domain h on the basis of the sampled coordinates (be- 
Thence we adopt a different sampling mechanism than the one used for simply introducing 


our first learning problem. This is not a drawback as the claim of Lemma 3.1 is distribution- 
free. 


Y(2)1 + Ph 


Learning Boolean functions 145 


Fig. 3.3: Generating 0.900 confidence intervals for the probability P} of a b_rectangle 
in Y = [0, 1]f from a sample of 30 elements. 

(a) The drawn sample and one of its possible labelings in a two-dimensional projection; 
points inside circles are deputed sentinels. (b) Points: courses of the sample frequency 
ġ and probability pa of falling inside a b-rectangle with a lattice of labeling functions. 
Plain curves: trajectories described by the confidence interval extremes for the actual 
4-dimensional case. Dashed curves: trajectories described by the confidence interval 
extremes for hypothetical 1-dimensional case. 


sides the bias that the rectangle left-lower vertex coincides with the origin 
of the axes and edges have same directions and orientations of them); 


e though coming from independent U;’s, some sample points, those labeled 
by 1, share the fact of being all inside h, and those labeled by 0 vice versa; 


e the above twisting argument holds whatever the coordinates’ joint distri- 
bution law is. 


The experiment in the figure confirms that the sole probabilistic consequence of 
these additional features in comparison to the original case study of Fig. 2.23 
is in the bounds of the confidence region. Now they are pushed up by the fact 
that four points in place of one need to witness the inclusion of h in a proper 
domain of measure p and at least one is additionally included in an enlargement 
of h with this measure. 


3.1.3 The PAC learning goal 


Let us move now to the formal statement of the Probably Approximately Cor- 
rect (for short, PAC) learning problem, today broadly identified with statistical 
learning theory. 


Definition 3.1. For a given space ¥, a concept is a Boolean function c : ¥ > 
{0,1}. A set of concepts defines a concept class C. 


sample points 
functionally 
linked, 


logical relations 
independent of 
the sample 

distribution law. 


A region of a 
space, 


a set of points 
partly inside 
partly outside it, 


a satisfactorily 
approximate 
guess about it. 


Approximation 
measure: the 
measure of the 
symmetric 
difference 


A learning 
algorithm 
computes 
hypotheses with 
any sample, up 
to some 
approximation 


146 Computational learning 


By abuse of notation we will refer to a concept either as a function c or as the 
support of it ë. 


Definition 3.2. Having fixed a space ¥, consider a random variable X and a 
concept c on it. An example Z of c is a pair (X,c(X)). This means that an 
example associates to an x € ¥ a label c(x) whose value is 1 if x belongs to 
c and 0 otherwise. Vice versa a function h : ¥ +> {0,1} is consistent with an 
example z = (x,c(x)) of c if h(x) = c(x). The function h is consistent with a 
set of examples if it is consistent with all its elements. 


Given & and general properties that define a class C of concepts on it, we consider 
the learning task of inferring from a set of examples the concept c in C underlying 
their labels. Actually we are satisfied with identifying a concept in a set H that 
we call hypothesis h, approximating c well. As mentioned before, to appreciate 
how close h is to c we refer to the probability distribution on ¥. We do not 
know it explicitly, but we observe the examples; with a compatible distribution 
we will be questioned about c(x) on new inputs. Our answer will be wrong on 
those points that either belong to c and not to h (thus we will answer 0 using 
h(x) while the correct answer is c(x) = 1) or belong to h and not to c (and 
our answer will be h(x) = 1 in place of c(x) = 0). The set of these points 
denotes the symmetric difference c + h between c and h, and we quantify the 
approximating capability of h through the probability P (c + h) of this region, 
usually denoted error probability Poror. Note that the symmetric difference in 
the example of the bounded rectangle may coincide with h — c since we draw 
a maximal hypothesis consistent with the examples, a hypothesis that does not 
guarantee, however, to include c. 

The formal statement of a learning task in our inference framework is the 
following: 


Definition 3.3. For fixed m € N, a labeled sample is a set 
Zm = {(xi, bi), i= 1; a% .,m} 


where b; are Boolean variables and x; € ¥. Given a Zm and a concept class C, 
assume that for every M € N and every (labeled) population Zm generated with 
the same sampling mechanism íz of Zm a c exists in C explaining both sample 
and population labels, i.e. such that miu = {(i,c(ai)), i = 1,...,m+ M}. 
For a set 2, of Zms sharing C, we call learning algorithm an algorithm & : 
Zn |> H, where H is a concept class as well, which 


e for any Zm E€ Fn; 


e for some pair °? of parameters £, 6 > 0; (denoted henceforth as accuracy 
parameters) 


8i.e. the set of points x € ¥ such that c(x) = 1. 
°Elsewhere a tighter request is that for every (e, 6) an mo exists such that for every m > mo 
the claim of the definition holds. 


Learning Boolean functions 147 


computes another function, that we denote as hypothesis h, such that the con- 
fidence interval [0,<] for the measure Pərror of the symmetric difference c +h 
between the two functions has at least confidence 1 — 6 (see Fig. 3.1). In for- 
mulas: 


Pl Pier < £) = 1 — ô (3.7) 


Remark 3.1. With this notation, h is a statistic from the labeled sample, and 
cis a specification of a random function labeling Zm suffixes, playing analogous 
roles of 7 and P respectively in the Bernoulli variable. Henceforth we prefer not 
to refer in the analytical expressions to the random function, to avoid confusion 
between its symbol C (according to our notation) and the C denoting the concept 
class. Rather we will encapsulate its randomness directly in the functions of this 
parameter we will manage. For instance Porro, will be denoted as Uc+np to mean 
that it measures c + h and is a random variable (capital letter) depending on 
the suffix of Zm c must explain (in Chapter 5 we will follow another notational 
strategy). 


3.1.4 Sentry Points 


Like in the rectangle case study, we relate the Perror measure of the symmetric 
difference c + h to the number of points falling inside it plus those outside it 
necessary for witnessing the inclusion of c+h into a domain B; of measure e (for 
short B-) for a given, possibly very small, e. The latter points are exactly what 
keeps our learning algorithm from computing an h such that c +h expands 
outside B. With reference to a generic concept class C, we dually characterize 
the functionality of these points — which we call either sentinels or sentry points 
— in terms of preventing a given concept c from being fully included (invaded) 
by another concept within the class. Thus, like the lord of a fortress, I send out 
of the walls sentinels to alert me of an invasion. They are located in such sites 
that, given his own topological constraints, the invader may never elude them 
all in surrounding the town-walls. Namely, they are assigned by a sentineling 
function S to each concept of a class in such a way that: 


e they are external to the concept c to be sentineled and internal to at least 
one other including it, 


e each concept c’ including c has at least one of the sentry points of c either 
in the gap between c and c’ or outside of c’ and distinct from the sentry 


points of c’, and 


e they constitute a minimal set with these properties. 


Perror > Ucn 


Stakes to prevent 
the expansion of 
the error region 


Formally: sentry 
function, sentry 
points, frontiers, 

detail 


148 Computational learning 


Henceforth we will deal with two categories of concepts: a general one made of 
concepts c, for which we will give a formal definition and further properties on 
sentry points; and a specific one of our learning problem, actually constituted 
by symmetric differences c + h between target concept and approximating hy- 
potheses. Thus from time to time we must map {c+h} on a proper {c} to apply 
sentry points’ properties. The technical functionality of the sentry points is to 
show the inconsistency with them of any c+ h’ D c+h 1°. Thus, the points are 
the output of a sentry function with the following properties: 


Definition 3.4. For a concept class C on a space %, a sentry function is a total 
function! S: CU {0, ¥} > 2¥ satisfying the conditions 


(i) Sentinels are outside the sentineled concept (cN S(c) = Ø for all c € C). 


(ii) Sentinels are inside the invading concept (Having introduced the sets 
ct = cUS(c), an invading concept d € C is such that d Z c and 
crc (c)t. Denoting up(c) the set of concepts invading c, we must have 
that if c2 € up(c1), then c2 N S(c1) #9). 


(iii) S(c) is a minimal set with the above properties (No S’ Æ S ezists satisfying 
(i) and (ii) and having the property that S' (c) C S(c) for every c € C). 


(iv) Sentinels are honest guardians. It may be that c C (c’)* but S(c)N¢ = 0 
so that c’ ¢ up(c). This however must be a consequence of the fact 
that all points of S(c) are involved in really sentineling c against other 
concepts in up(c) and not just in avoiding inclusion of c* by (c’)*. Thus 
if we remove c’, S(c) remains unchanged (Whenever cy and c are such 
that cı C c2 U S(c2) and c2 N S(ci) = Q, then the restriction of S to 
{c1} Uup(c1) — {c2} is a sentry function on this set). 


S(c) is the frontier of c upon S and its elements are called either sentry points 
or sentinels. 

With reference to Fig. 3.4, {£1, £2, £3} is a candidate frontier of co against 
C1, C2, C3, C4. All points are in the gap between a c; and cg. They avoid inclusion 
of coU{a1, £2, 73} in c3, provided that these points are not used by the latter for 
sentineling itself against other concepts. Vice versa we expect that cı uses xı 
and 3 as its own sentinels, c2 uses 22 and x3 and c4 uses xı and z2 analogously. 
Point x4 is not allowed as a co sentry point since, like some diplomatic seat, 
should be located out of any other concepts just to avoid that it is occupied in 
case of cp invasion. 

The frontier size of the most expensive concept to be sentineled with the 
least efficient sentineling function, i.e. the quantity Dc = sups e #S(c), is called 
detail of C, where S spans also over sentry functions on subsets of ¥ sentineling 


101. that at least a sentry point x; falls in c +h’. 


11 As usual in theoretical computer science a function f : A ++ B does not need to be defined 
for each element of A, to take into account never ending computations such as infinite loops. 
When on the contrary a function is defined on every element of its domain, it is said to be 
total. 


Learning Boolean functions 149 


Fig. 3.4: A schematic outlook of outer sentineling functionality: co concept sentineled 
against c1, C2,¢3,ca through {x1, 22,23}. x1 and x3 are used as its own sentry points 
also by ci, and x2 prevents cı to invade co; x2 and «x3 are sentry points also of c2, and 
xı prevents invasion of co by c2; and x; and x2 are sentry points also of c4 and x3 
prevents invasion; c3 does not include co U S(co), hence it cannot invade co. x4 is a 
useless point. All marked points are negative examples of co. 


in this case the intersections of the concepts with these subsets. Actually, proper 
subsets of ¥ may host sentineling tasks that prove harder than those raising with 
X itself. 

Moving to symmetric differences, for another set H of concepts let us con- 
sider the class of symmetric differences c+H = {c+h Yh € H} for any c belonging 
to C. The detail of a concept class H w.r.t. c is the quantity Dey = Deu. Vice 
versa, for a fixed h we consider an augmented sentineling functionality against 
the entire class C where to each h is assigned a minimal set of points neces- 
sary to sentinel any c+ h, for any c € C against c+ H. Let us denote by 
Dic.n), the cardinality of this sentry set, and define the overall detail of the 
class C +H = Ucece + H as the quantity Deu = suppey{D(cn), Y a 


A given concept class C might admit more than one sentry function. As 
in the case of other class complexity indices such as the Vapnik-Chervonenkis 
dimension that we will consider later on, the detail of a class is a parameter 
extremely meaningful yet difficult to compute. Here below we give some ele- 
mentary examples, claiming that for more complex classes the computation of 
D remains an open problem. The interested reader can find further examples 
on this matter plus details on all related results discussed on in this chapter in 
a set of references we will list in the last section. 


Example 3.1 (Discrete spaces). 


e Consider the class C = {c1,c2,¢3,c4} on ¥ = {21, £2, £3} whose concepts 


and for | 
symmetric 
differences 


Detail isa | 
complexity index 
that is often 
difficult to 
compute 


Much careful 
checks in case of 
discrete spaces 


150 Computational learning 


are illustrated in the following scheme, where “+” denotes an element 2; 


belonging to c;, “—” an element outside c; and © a sentry point: 
Tı T2 T3 
a= O QO = 
a= O + + 
c3 = O + 
a= + + + 


This class has Dc = 2. As usual we may have different sentineling func- 


Some sentry tions. A worst case S, as illustrated, is: S(c1) = {£1, £2}, S(ce) = {ai}, 
mote favorable S(c3) = {x2}, S(c4) = Ø. However a cheaper one is S(c,) = {23}, 
eer S(c2) = {xi}, S(c3) = {v2}, S(ca) = 9: 
$i T2 T3 
C1 = == — © 
a= © + + 
c= + O + 
C= + F + 


e Let the class of concepts be constituted by 


Ti To T3 


Ci © = => 
C2 + O = 
C3 O 
C4 H 


where as above ¥ = {21, 22,73}, C = {c1, C2, C3, C4}. A sentry function 
No offshore seats is: S(c1) = {x1}, S(c2) = {x2}, S(cs) = {x3} and S(ca) = 0. Note that, 
are allowed f Š A 7 5 i 
according to requirement (iv), the set {x2, £3} is not feasible as a frontier 
for cı because otherwise it would be cı C cp and c2 N S(c1) = 9. But the 
restriction of S to C = {c1} Uup(c1) — {c2} = {e1, c3, c4} would be 


Li T2 T3 
a -= © O 
C3 + + © 
ca + + + 


which would violate minimality requirement (iii): in fact S’ (c1) = {a}, 
S’ (c3) = {x3}, S' (c4) = 0 are contained in the frontiers assigned by S and 
are frontiers as well. 


e The concepts represented in Fig. 3.5(a) with same notation as in the pre- 

Check for Æ vious point are in no inclusion relation with one another, hence they con- 
stitute a class C with detail 0 (actually 1 if we add the whole set and the 

empty set as concepts, according to Definition 3.4). However they identify 


Learning Boolean functions 151 


wI T2 T3 LA T5 T6 T7 T1 T2 T3 
a= } a= © O - 
C2 = } a= OC 
c3 = H c= (©) 
c4 = 2 a i c4 = 


Fig. 3.5: (a) A class of concepts without any inclusion relation between its elements. 
(b) Their intersections with subspace {x1, x2, £3} need two sentry points in the worst 
case. 


with the class in Fig. 3.5(b) if the last four components of ¥ are hidden, 
thus requiring two sentry points in the worst case as in the figure. There- 
fore Dg = 2 (but a cheap sentry function is: S(c1) = {a3}, S(c2) = {a1}, 
S(c3) = {x2} and S(c4) =). 


e The class of monotone clauses [Wegener, 1987] has detail 1. Each clause Sometimes very 
(vj + vj +... +v) — where v’s are never negated propositional variables — is pois Seny 
sentineled by the point x of the Boolean hypercube whose i-th component 
is 0 if v; is an addend in the clause, 1 elsewhere. 


Example 3.2 (Continuous spaces). 


e Class C of circles in R? has detail Dc = 2, as evident in Fig. 3.6 (a). Note 
that the class we want to sentinel in the task in Fig. 3.1 is constituted 
by symmetric differences between circles. In this respect examples are all 
negative points; some of these will act as sentry points. A worst case is A sentry point 
when either c C h or vice versa. In both cases it is a matter of sentineling "mây Pe useless 
a circle again, thus also the detail of this class equals 2. Apparently, 
symmetric differences like in Fig. 3.1 require only a point to sentinel any 
inclusion by a concept made of the symmetric difference of h with a c 
(for instance, one of the two points in Fig. 3.19), since to have inclusion 
c must cross h in the same points as cp does. However, since we do not 
know c, we cannot exclude that the point we selected as a sentry exactly 
falls in the intersection of c with h. In this case we need a second sentry 
point. A third point is unnecessary, since having two points exactly on the 
intersections of c with h and a third in other position on the border of the 
circle has the same effect of having only one point in an intersection and 
another elsewhere: the former sentinels c + h if c crosses h in the latter 
and vice versa. 


e As seen in Sec. 3.1.2, the detail of the class of b-rectangles in R” is n. 
Analogously, the detail is 2n if we consider the class of intervals in R”. 


A finite set of 
sentry points for 
sentinelling non 
enumerable sets 
of points, 


hence we must 
increase their 
number 


152 


Computational learning 


z \ 
\. 4 yoo 


(a) (b) 


Fig. 3.6: (a) Two points x1, x2 outside c (thick circle) are sufficient to prevent a larger 
circle not containing them from including it. (b) Seven points are sufficient to sentinel 
a pentagon in the worst case. 


Dc is [4/3k] 1? if the concepts are convex polygons on R? having exactly 
k edges (Fig. 3.6 (b) shows a worst case frontier for the case k = 5; see 
[Apolloni and Malchiodi, 2001] for the general case proof). The class C+ C 
of the symmetric differences between the above polygons has detail [4/3k] 
as well (see Lemma 3.5 later in this chapter). 


Dc is 1 if the elements of C are oriented half lines on R (see Fig. 3.11 later 
on), and 2 if they are segments on R. Dc is n if elements of C are oriented 
half-spaces in R”. Let C be the concept class of half-spaces defined on 
R”. With H = C, Dc, = n. Indeed, let us identify by abuse concepts 
with the oriented hyperplanes binding them. Sentineling the expansion 
of the symmetric difference c + h results in forbidding any rotation of 
h into a h’ pivoted along the intersection of c with h. Whatever the 
dimensionality n of the embedding space, in principle we would need only 
1 sentry point, provided we know the target concept c. In fact constraining 
h’ to contain the intersection of h with c gives rise up to n — 1 linear 
relations on h’ coefficients, resulting, in any case, in a single degree of 
freedom for h’ coefficient, i.e. a single sentry point. However, as we do 
not know c, the chosen sentry point may lie exactly in the intersection 
between c and h, preventing the former to sentinel the expansion of the 
symmetric difference. So we need one more sentry point, in analogy with 
what discussed about circles’ concept class in the previous point, and in 
general, as many points as the dimensionality of the space. Figs. 3.7(a) 
and (b) illustrate this concept in case n = 2 and n = 3 respectively. As 
shown in Fig. 3.7(a), to insure that the sentry points witness the expansion 
of the symmetric difference (gray region), we must put a new sentry 
point s2 beside s1, to consider the case in which sı lies in the intersection 


12where [a] denotes the ceiling of a, i.e. the smallest integer greater or equal to a. 


Learning Boolean functions 153 


(a) (b) 


Fig. 3.7: Sentry points in the worst case needed to sentinel the symmetric difference 
between the hypothesis h and the target concept c in (a) two-dimensional and (b) 
three-dimensional space. 


between c and h. The same happens in a three-dimensional space, as 
shown in Fig. 3.7(b). Here both sı and s2 may lie on the straight line 
resulting from the intersection of h with c. Therefore we need a new point 
s3 to avoid any rotation of h into a h’ with a high symmetric difference. 
Note that in the general case, 


1. n—1 points non linearly related are necessary to identify the intersec- 
tion between two hyperplanes in an n dimensional Euclidean space. 
Hence, 


2. since we must sentinel any expansion of the symmetric difference c+h 
for whatever unknown c, a worst case sentry is constituted by n non 
linearly related points, n — 1 of which lie on the intersection c + h 
and the remaining one binds the rotation of h. 
e Consider the class C whose elements can be expressed as the union of an sometimes an 
. E ; : s $ N infinite set of 
arbitrary number of disjoint intervals in R. Sentineling one item c € C sentry points. 
demands sentineling each interval of which c is composed. As there is no 
bound on the number of these intervals, C has detail Dc = +00. 


3.1.5 A twisting argument for learning 


Thanks to the parameter D defined in the above subsection we can state the 
following twisting argument. Denote by uc+p the probability measure of the 
symmetric difference between a candidate concept and its approximating hy- 
pothesis, i.e. the parameter Perror whose distribution law we want to infer. The 


Do,H is the 

__ number of 
witnesses of an 
error increasing 


1 is the minimal 
cost in terms of 
sample points for 
this increase 


In case of 
consistent 
hypotheses, 


if the learning 
algorithm works 
for any sample, 
and 


the hypothesis 
grows with the 


number of 
included 
examples 


154 


Computational learning 


related statistic k,,.,, counts the number of actual sample points falling in c+h, 
and ks the analogous number for an enlargement of c+ h of measure e (for short 
an €-enlargement). By definition the sentry points within the labeled sample 
witness the inclusion of c+ h in Bz as in (3.2). Thus we extend (3.2) and (3.3) 
as follows: 

(ke > kusin +1) <= (tern < £) (he > hu, + Den) (3.8) 
These implications are distribution-free in the sense that they hold for whatever 
distribution law will describe the suffix of the observed sample. Namely, right 
implication is a matter of topological expansion not affected by this distribution 
law. Left implication says there must be a distribution law on ¥ filling up 
by a probability mass ¢ an algorithmic expansion of the symmetric difference 
violating at least a sentry point. This distribution belongs to the family of the 
actual suffix distributions, but it exists no matter what the latter is, if we have 
no special constraint on the family (see Sec. 3.3.1). 

If in particular h is a consistent hypothesis, then k,,.,, = 0. The number of 
points inside the questioned domain is the sole difference between the present 
learning task and the inference task in Fig. 3.3. In both cases we look for a max- 
imal consistent domain for stating a worst case right implication in the twisting 
argument. Left implication refers to any consistent hypothesis computed by a 
learning algorithm. The companion probability chain reads!*: 

I,(1,m) > Fu,.,(€) > IL-(Dc,n,m — Dc, +1) (3.9) 

In stating (3.9) we understood two reasonable properties of the learning 
algorithm. Namely, on the one hand we require that the algorithm computes 
a hypothesis also when a part of the sample space is blinded by the sampling 
mechanism, so we can rely on examples coming only from a subspace of ¥. 
This is formalized through a strongly surjectivity property. On the other hand 
we look for non captious functions so that the symmetric difference between 
concept and hypothesis monotonically expands as much as sample point subset 
included in it enlarges as well. 


Definition 3.5. Given a concept class C on a space ¥ and the specification Zm 
of a sample drawn from the corresponding probability space, for any W C ¥ 
and B C 2X let us denote 


e by Bw the quotient set of B with respect to the equivalence relation on 
the subsets of ¥ defined by having the same intersection with W (i.e. 
two concepts of B having a same intersection with W represent a single 
concept in this subspace); 


e by Wm the set of labeled points from Zm having all the components inside 
W (i.e. sample points can be drawn only from W). 


13 See [Apolloni and Chiaravalli, 1997] for a more direct proof of this statement. 


Learning Boolean functions 155 


A function & : {Zm} +> Cis strongly surjective if for each subset Y of the sample 
space, &/ is a surjection from Y, to Cy (i.e. computes a hypothesis even when 
the observed points are so confined that they do not allow identifying any specific 
hypothesis; in this case &@ computes an element of the above equivalence class 
of hypotheses). A function as above is fairly strongly surjective if, in addition, 
it happens that for any given Zm and h = Æ% (Zm) partitioning the examples 
set into set z” of those belonging to h and its complement, any h’ such that 
zł, C zi includes h. Vice versa, any hypothesis h’ € C including h will contain 
a superset zł, (i.e. the more sample points there are inside h the more population 
points there are and vice versa.) 


The above definition fixes some reasonably operational properties of a learn- 
ing algorithm. Namely, surjectivity insures correct hypotheses also when a part 
of the sample space is blinded by the sampling mechanism. Hence we have a 
reduced freedom for drawing examples but are interested in distinguishing be- 
tween the sole intersections of C elements with the reachable part of the sample 
space 14. Fairness states a monotonicity relation between symmetric differences 
and the set of included sample points. For short, given a sample, a hypothesis 
h and its symmetric difference c +h, an h’ in the range of the same learning 
algorithm such that c+ h C c+ h’ must include in c + h’ a number of misla- 
beled examples greater than h. This property, which is automatically satisfied 
by consistent hypotheses, must be specified for non consistent ones. Finally, 
satisfying the consistency constraint on the hypotheses adds the twofold benefit 
that: i) their class H coincides with C (apart from the formal descriptions of 
their items or a possible set of points having null measure), and ii) fairness is 
always guaranteed. 

With these caveats, we repropose (3.9) in the following theorem 


Theorem 3.1. For a space ¥, assume we are given 
e a concept class C on ¥ with Dec = n; 
e a set Zn of labeled samples Zm € ¥ x {0,1}; 


e a fairly strongly surjective function % : Zm ++ H = C computing consis- 
tent hypotheses. 


Consider the families of random sets {c E€ C : mim = {(ai,c(x;)), 
i=1,...,m+ M} when amim spans the the samples in Zm and the specifica- 
tions of their random suffixes Zm, with M — +00, according to any sampling 
mechanism. For a given (Zm,h), denote with un the detail Dio,c),, with Uc+n 
the random variable given by the probability measure of c+ h and by Fu,., its 
c.d.f. Then for each € € (0,1) 


I.(1,m) 2 a saa (e) 2 Te (Un, mM — Uh + 1) (3.10) 


14With notation of Definition 3.5, the sentry functions to be considered in the sup #S8(c) 
for computing Dg must sentinel Cy when are defined on subset Y of &. 


that is to say æ% 
is a fair strongly 
surjective 
function, 


the error c.d.f. is 
bounded by two 
Beta functions. 


The key 
parameter is the 
detail of h 


No knowledge is 
assumed neither 
on the concepts 
nor on the data 


We base on a 
sufficient statistic 


Sufficient 

_ condition: it 
exists a sequence 
of regions 


identified by the 
c +h sentry 
points 


including both 
c+handa 
region B- of 
measure £. 


156 


Computational learning 


Bişi 


Bi =c+hUS(c+h) 


Fig. 3.8: The nested sequence B((c + h)™) used to build a twisting argument. B;’s: 
elements of B. 


Note that (3.10) holds for each specific Zm. Thus its extension to any sample 
coming from any observation history, i.e. for (Zm, % (Zm )), is the following: 


Corollary 3.1. With same notation as in Theorem 3.1, 


I-(1, m) Z Fup \e) > I.(u,m—pt+1) (3.11) 


being u = Dicc) the supremum of un over h. 


This gives the operational meaning of the theory. No matter the concept 
class you want to learn time to time, no matter the distribution law of the sam- 
ples resulting from the synthesis of their asymptotic sequence, if the detail, 
rather an upper bound to the details of the various classes of symmetric dif- 
ferences is u, then the asymptotic frequencies of the learning error are gauged 
as in (3.11). To fix the rationale of this theorem better, let us reconsider the 
statement (3.11) directly as follows. 

From the statistical viewpoint twisting argument (3.8) is well posed since 
the random variable whose specifications kę appear therein fulfills the necessary 
condition of being a sufficient statistic. In terms of logic, we can realize that 
Dey is an upper bound to the number of sample points sufficient for witnessing 
an eventual increase of U.=,. In particular, we assume at moment h to be a 
consistent hypothesis, thus having ky,., = 0, and H = C. Then for each c 
candidate to explain present and future labeling histories (thus consistent with 
both Zm and any of its suffixes zm ) we can build a nested sequence B((c+h)t) = 
Bı C Bə C B3 C... of subsets of ¥, such that (c+ h)t = c+hUS(c+h) 
belongs to B((c + h)*) and the previous element in the sequence is included in 
c~h (see Fig. 3.8). This sequence will be used to state the right implication in 
(3.8). Indeed, consider the companion sequence U((c + h)*) of the probability 
measures of the sets in B((c+h)*). For a given sample Zm, B((c + h)T) is 
a random sequence since Zjy is random. Thus for a fixed £, the subset Bz of 
probability measure € in the sequence might include c+ h or not, where S(c+ h) 


Learning Boolean functions 157 


is exactly the witness of this inclusion. In greater detail, sample Zm contains the 
frontier of c+ h; thus, if this part of the sample is included in Be we are sure that 
for this c it is c+ h the pivot of a sequence to which B, belongs to. Indeed with 
these points no different h’ originating a greater symmetric difference could have 
been computed. Hence c+h C Bs as well. The implication chain is completed 
as follows: 


i) on the right by the fact that S(c+h) C Be if and only if Ke > #S(c+h), 
where K+. is the number of those from among the sampled points which 
fall in Bz, and 


ii) on the left by the fact that the event uc+n < £ is implied by the event 
UWerh)+ < € and the latter by (c +h) C Bz. 


Namely: 


a. 


(Ucth < £) = (U(exn)+ < £) < ((c+ h)” C Bz) 
<= (S(c+h) C Be) & (ke > #S(c+h)) (3.12) 


This induces an opposite chain on probabilities, as an extension of inequal- 
ities in (2.16): 


P (Usen < £) > P (Ke > #S(c¢ + h)) > P” (Ke > Dec) (3.13) 


The threshold 1 in the left part of (3.8) is due to the fact that h is, in turn, a 
function & of a sample specification Zm. Since the symmetric difference of c 
with the h it produces grows with the set of included sample points and vice 
versa’, then (Uc+n < £) implies that an c-enlargement region within any c+ h’ 
containing c+ h must include at least one more of the sample points at the basis 
of h’s computation 16. 

The probability chain is thus completed by the following formula: 


p (Ke 2 1) =P (Ucn < £) >P (Ke 2 De,c) (3.14) 


Finally, as in (2.16), A. follows by definition a Binomial distribution law 
of parameters m and e. Thus (3.14) reads in terms of the incomplete Beta 
function: 

I-(1,m) > Fy,.,,(€) 2 I-(Dc,c,m — Dec + 1) (3.15) 


The main lesson we draw from the above discussion is that when we want to 
infer a function we must divide the available examples in two categories, relevant 


15 According to the strongly surjectivity property of the related learning algorithm. 

16 Indeed, any hypothesis is the output of a computation on a labeled sample: more precisely, 
only a subset of the sample will determine h, and we will refer to this subset as the signature of 
the outputted hypothesis. Left implication in (3.8) reads: to enlarge the symmetric difference 
measure to £, a different h needs to be outputted by the learning algorithm. Hence its signature 
must have changed, i.e. at least one point in the sample must have switched its label from 1 
to 0, whose binding functionality denotes it as a sentry point. 


If sentry points 
€ B. then 
c+h C Be as 
well 


and we are done. 


No growth of 
c+ h without a 
sentry point 
inclusion 


158 Computational learning 


A few key points, Ones and the mass. As in a professor’s lecture some (the former) fix the ideas, 
the remaining 's_ thus binding the difference between concept and hypothesis. The latter are 
redundant but if we produce a lot of examples we are confident that a sufficient 
number of those belonging to the first category will have been exhibited. The 

above equation allows us to state a relation between the two category sizes. 
How we interpret the randomness of the symmetric difference probability 
We may aim at represents the divide between conventional approaches and our own. Coming 
Concept inspite back to the initial example of localizing a polluted region, we assume in the 
ef the sample conventional approach that the inhabitants distribution law is available (at least 
in terms of the list from which to randomly pick a sample, say the local telephone 
directory) and that a circle c exists delimiting the polluted region. Then we 
extract a sample — the monitoring stations — receiving labels from it and must 
approximate c with another circle h computed on the basis of the observed 
labeled points. To address the computation, we may assume for instance the 
consistency constraint of h giving the same labels as c to the sampled points. 
Since A is a function of the random sample, it is a random variable itself, having 
specifications in ho, hi, ... as in Fig. 3.9(a), whose probability law derives 
from the sample’s. In the figure ho is consistent with the points drawn therein, 
while we may assume the other hypotheses consistent with other sets of sampled 

points each. 

Rather, given a In our perspective we dually consider that a circle h is computed on the basis 
et erie of the observed data (for instance with the same consistency constraint) and 
spite of the we assume that a circle c will exist satisfying the same constraints on both the 


Popa cept actual and future observations. Since we can have plenty of different observation 
randomness tistories we can have a lot of different circles, like co, c1, ... as in Fig. 3.9(b), 
as well. However, the fact that both c and h must give the same labels to 

the actual sample allows us to probabilistically describe the family of possible 

c’s in terms of the mentioned error probability, provided that no exceptional 
demographic phenomenon makes the sites of the new requests on the pollution 

state heterogeneous in respect to the monitoring stations (i.e. provided that 

the sampling mechanism is the same). Thus all concepts must be consistent 

exactly with the points drawn in Fig. 3.9(b). Note that, although a circle is a 

good model for pollution diffusion in an isotropic medium, our approach is less 
demanding: we assume that a circle is suitable simply to explain the labels of the 

future points. This means for instance that even if the medium is non isotropic, 

because a marsh increases the diffusion in a special region, the adequacy of the 

circle remains intact provided no people live in the marsh (i.e. the probability of 

drawing a point in the corresponding region is zero). Also the relations between 

past and future observations denote subtle differences. The a priori inhabitants 
distribution law translates in an a posteriori check of data homogeneity in our 
approach. Though many differences like these can be recovered de facto in 

the classical approach, thus resulting in a questionable philosophical preference, 

our perspective in any case allows a simpler and more profitable handling of the 

Allwhat we learning task. We are interested mainly not in the truth (if there is any) but in 


search for is a 


epia action a suitable function (algorithm from a computational perspective) describing the 
observed data data. That is why we call this the algorithmic inference approach. Usual results 


Computing the hypotheses 159 


(a) (b) 


Fig. 3.9: Conventional and algorithmic inferences in computational learning. (a) Con- 
ventional learning framework. c: a concept from the class of circles; hi: hypotheses 
approximating c. (b) The corresponding algorithmic inference framework. h: the circle 
describing the sample according to a learning algorithm; c;: possible circles describing 
the population. Same notations as in Fig. 3.1. 


check a measure of P (U.;, < £) along the samples’ history. Here we claim that 
for each sample realization Zm we have that U.+;, is a random variable bounded 
as in (3.15). 


3.2 Computing the hypotheses 


Computing a suitable statistic on the labeled sample is the second face of the 
medal. This may be a difficult task either because it requires hard conceptual 
tools for devising the algorithm, or because its implementation needs consider- 
able computing resources (such as memory space and running time), or both. 
The goal is to have a hypothesis about the function underlying the data struc- 
ture. Since the function we look for is Boolean, a crucial discriminant is whether 
or not we search for a hypothesis consistent with the sample. 


3.2.1 Identifying consistent hypotheses 


Computing a hypothesis consistent with positive and negative examples is gen- 
erally not an easy job. It is an obvious job in elementary cases like these: 


Example 3.3. If the class H of hypotheses is the set of all segments in R, a 
pair of extremes of a consistent hypothesis is delimited by the minimum and 


Who computes 
hypotheses? 


Easy computable 
hypotheses 


160 Computational learning 


maximum coordinates of positive points; another by the open segment delim- 
ited by the coordinates of the rightmost negative point preceding a positive one 
and the leftmost negative point preceded by a positive one. Other consistent 
segments may be obtained by exchanging extremes between the two pairs. Of 
course any intermediate points are candidates, provided they are algorithmi- 
cally determined; for instance the points halving the distance between extreme 
positive and negative points on both sides, etc. 

If class H is the set of b_rectangles in R?, we obtain a pair of free edges of a 
consistent open rectangle by using the ordinate of the first negative point having 
ordinate higher than the one of the uppermost positive point, and abscissa lower 
than one of the rightmost positive point. We work analogously for the abscissa 
of the other edge. Possibly we refine the computation on the points having 
coordinates jointly intermediate between the found ordinate and the one of the 
uppermost positive point, and between the found abscissa and the rightmost 
positive one. 


But even if H is the set of circles the computation of a consistent hypothesis 
requires some more sophisticated reasoning, involving our deepest ability in 
designing numerical programs. For instance, a possible solution is: 


Somehow harder Example 3.4. Let a sample of points in R?, each equipped with a Boolean label, 
computable z r ‘ y : : 
hypotheses be given. Consider the problem of finding, among all the circles consistent with 


the sample, the minimal one (i.e. the circle contained in every consistent circle). 
Algorithm 3.1 learns such a circle. 


Algorithm 3.1 Learning a consistent circle. 
1. Find two positive points maximizing their mutual distance. Let these 
points be x; and xo; 


2. Let h be the circle having x; and x2 as extremes of one of its diameters; 
3. while h is not consistent 


(a) find the misclassified point x3 having maximal distance from the 
center of h; 


(b) let h be the circle passing through x1, x2 and x3; 


(c) if h is not consistent, swap x3 with either x; or x2. 


4. return h. 


We can reasonably claim that the computational difficulty, say the running time, 
grows with the number of sentry points. A general algorithm for discovering 


Computing the hypotheses 161 


a consistent formula within a class H could go in search of the sentinels first 
and then draw a hypothesis consistent with them. This happened in our ex- 
ample with b_rectangles. Rather, as with circles, searching and drawing may 
be performed simultaneously. As a matter of fact we have such a large variety 
of computational instances that our programming ability is keenly challenged 
at each learning instance. Finding k sentinels within a sample of m items may 
require exploitation of the full combinatorial scheme of () trials, each requiring 
a sentineling functionality check of any complexity. Drawing a hypothesis con- 
sistent with the sentinels may be easy like for rectangles or the class of Support 
Vector Machines that we will see in a next example; or it may be very hard 
like with greatly spread classes of Boolean functions that we will discuss in a 
moment. 

To get an overall picture of the computational load in finding consistent 
hypotheses, let us come back to discrete variables, our actual operational frame. 
The decision version of the task is the following CONSISTENCY problem. With 
n being the number of bits in input to our formulas, m(n) a polynomial in n and 
L< n(n), it reads: 


Problem 1. CONSISTENCY (H,, /, b) 

Instance: a labeled sample Zm on ¥ = {0,1}";v=nxm 

Question: IS THERE a function h € H, that can be described in a m(n) bits, 
is consistent with Zm and the l long prefix of its description is a given binary 
string b? 


As well known, a conventional divide between soft and hard computational 
problems is the function expressing the dependence of the running time on the 
length v of the problem input. If this function is a polynomial in v, whatever the 
assignments to the v bits will be, then we assume the running time bearable and 
the computational problem soft. In particular, such decision problems (whose 
solution is yes or no) are said to belong to the class P of the problems that can 
be solved in polynomial time (see Appendix D). 

Now, on the one hand it is evident that if CONSISTENCY(H,,/,b) is in 
P, then it is possible to find an algorithm that outputs a consistent hypothesis 
(if any) in a time polynomial in v. Simply, we run m(n) CONSISTENCY prob- 
lems according to Algorithm 3.2, where, by abuse, we denote with a+ {x} the 
operation of appending the character x to the string a: On the other hand, the 
following fact holds: 


Fact 3.1. According to the widest conjecture CONSISTENCY (H,, 1, b) does not 
belong to P. 


Actually, there is no formal proof of this sentence. But in line with what 
happens with one way functions in random number generation, we nevertheless 
accept it as true since in thirty years or so no computer program has discovered 


In general 
computing 
consistent 
hypotheses may 
prove very hard 
when not 
unbearable 


CONSISTENCY 
is a typical 


NP-hard problem 


at the basis the 
search problem 
for h. 


Hence no magic 
stick is 
guaranteed for 
learning 


For a Boolean 
formula 


it is easy to 
compute a 
concise 
representation 
through a short 
Conjunctive 
Normal Form (if 
it exists), 


it may prove 
impossible 
through a short 
Disjunctive 
Normal Form. 


162 Computational learning 


Algorithm 3.2 Computing consistent hypotheses bit by bit. 
b = Í; 
For i=1 to m(n) 
If CONSISTENCY(H,, i, b + {1}) 
Then b= b+ {1}; 
Else b = b + {0}; 
End If 
End For; 
return b; 


its falsity; and we assume this as a meaningful sample of what will hold in the 
future. Hence, also our algorithm cannot run in polynomial time. Rather we 
can assert that within the same conjecture our target formula cannot be learnt 
through H, in polynomial time. 

A class of concepts where it is easy to find a consistent one is constituted by 
k-CNF’s. 


Example 3.5. Let ¥ be the space X,, = {0,1}" of the Boolean vectors x of 
size n, which can be assigned to the set vn = {v1,...,Un} of propositional 
variables in argument to Boolean formulas, a literal £ being an affirmed v or 
a negated V propositional variable. A formula is satisfied by an assignment to 
the propositional variables if the result of the formula is 1. A clause is the 
disjunction (4; V 4j V ++- V lk) of literals and a Conjunctive Normal Form, or 
CNF for short, is the conjunction of a set of clauses. A k-CNF is a CNF whose 
clauses, in any number, are the union of at most k literals. The list pruning 
algorithm proposed by Valiant to learn a k-CNF on the basis of sole positive 
examples is outlined in Algorithm 3.3 [Valiant, 1984]. 

Note that the formula we obtain at the end of the run is the minimal k- 
CNF h consistent with the examples, in the sense that no k-CNF exists whose 
support is strictly included in the one of h and still consistent with the examples. 
Moreover, the number of possible clauses is less than f_o (?") < (2n)¥+!. Thus 


t 
our task is polynomial if applied to a polynomial number of examples. 


On the other hand, it is possible to prove that: 


Fact 3.2. /Pitt and Valiant, 1988] Denoting by monomial the intersection (Li ^ 
Lj N... A Lk) of literals, k-term DNF (Disjunctive Normal Form) the union of 
at most k monomials and k-term CNF the intersection of at most k clauses, no 
matter the number of literals each monomial or clause is composed of, there is no 
polynomial algorithm for computing a k-term DNF or a k-term CNF consistent 
with a polynomial number of examples whenever k > 2. 


We should note, however, that the time complexity of learning a given con- 
cept class might strongly depend on the functional representation of the hy- 
pothesis. If we focus on Boolean formulas, finding within the class of k-term 


Computing the hypotheses 163 


Algorithm 3.3 List pruning. 
e Start with the union of all possible clauses with at most k affirmed or 
negated propositional variables; 


e On each example x, erase from the formula all clauses that are not satisfied 
by x. 


DNF formulas elements approximating another one appears unfeasible in light 
of Fact 3.2. On the contrary, if we decide to use a less concise description of 
these formulas, for instance representing them through k-CNF formulas, we can 
find within this class a hypothesis on a k-term DNF in polynomial time. We 
will speak of proper learning when concept and hypothesis classes coincide, of 
non proper learning otherwise. 


Example 3.6. Support Vector Machines (SVM for short) 
[Cortes and Vapnik, 1995] represent a widely used tool to compute the hy- 
pothesis in the class H of hyperplanes in R”, for a given n € N. Given a sample 
{x1,...,Xm} whose associated labels are {b1,...,bm}, this tool first computes 
the argument {aj,...,a*,} of the solution of the dual constrained optimization 
problem (where - denotes the standard dot product in R”): 


m 


m 1 
max 5 DW = 7 5 aiajbibjXi Xj (3.16) 
i=1 


Miyim Soa 
igel 
X aibi=0 (3.17) 


a,>0 tS lm (3.18) 


and then returns a hyperplane whose equation is w- x + = 0, where 


i=1 
¢ = yi— w:x; for i such that až > 0 (3.20) 


In the case the sample can be discriminated using a hyperplane (separable sam- 
ple), this algorithm is guaranteed to produce a consistent hypothesis with opti- 
mal margin, i.e. a hyperplane maximizing its minimal distance with the sample 
points. Moreover, typically only few of the components of {aj,...,a*,} are 
different from zero, so that the hypothesis depends on a small subset of the 
available examples (whose elements are denoted support vectors). 

Support vector machines can be trained to learn nonlinear separating sur- 
faces. The general idea is to map sample points into a higher dimensional feature 
space where they become separable through a hyperplane. As sample points ap- 
pear throughout (3.16-3.20) only as arguments of dot products, it is possible to 


Also a consistent 
hyperplane is 
easy to compute 
with a SVM 


maximizing the 
margin between 
it and 
neighboring 
points. 


The same if the 
hyperplane is in a 
suitably devised 
dummy space 


Simplicity of a 
formal 
description of 


sampled points 
may come at a 


cost in term of 
its accuracy 


We can manage 
it quantitatively 


164 Computational learning 


maintain the general structure of the algorithm substituting in it each occur- 
rence xj - xX, of this product with the computation of a kernel K (xj, Xk) which 
accounts for both mapping the points in the feature space and computing the 
dot product therein. 


3.2.2 Relaxing the consistency constraints. The structural risk 
minimization 
The consistency requirement to æ% is exactly the condition allowing a twisting 
argument statement. However, identifying C with the class of circles may prove 
a very simplistic way of modeling the diffusion of a polluting agent in a territory. 
The underlying homogeneity hypothesis of the diffusing medium may hold only 
roughly. A more realistic class could be the set of ellipses or perhaps curves 
with more degrees of freedom and, what is relevant for our inference, greater 
detail. On the other hand C is so relatively easy to manage that we wonder what 
damage may come from this simplification. As for the examples, we may expect 
we might not be able to find a circle that correctly divides positive from negative 
examples. This means we could find in the symmetric difference some points 
that make ku,.„ greater than 0 in (3.8). In case of a fair learning algorithm 
we extend (3.9) by adding to Dc an upper bound t to k,.,, in the right 
implication of (3.8) and adding to 1 a lower bound t” to this quantity in the 
left implication. In this case however U,,; cannot be less than the probability 
mass of the mislabeled points. Note also that we now have the hypothesis class 
H different from the concept class C with possibly Dc- < Dc=c. 
With these classes of hypotheses (3.10) reads: 


Theorem 3.2. Assume the hypotheses and notations of Theorem 3.1, allowing 
now Æ : {Zm} œ> H to misclassify points. More precisely, for h = æ% (Zm) let 
tn and pn be the number of these points and their total probability. Then for a 
given (Zm,h) and each B € (pn, 1) 


Ig(1 T th,m = tn) 2 Fizan (b) 2 Ig(un =F th,m = (un + th) ar 1) (3.21) 


To give again an operational meaning to the above equation we consider a 
long sequence of samples and bounds on both ta and ph, and state: 


Lemma 3.2. In addition to notation in Theorem 3.2, denoting with t and t" 
an upper and lower bound to tp, respectively, guaranteed by &, and with p an 
analogous upper bound on pp, for each (Zm,h) and B € (p,1) 


Ig(1+t",m—t") > Fu..,(8) 2 Iglu +t, m- (u +t) +1) (3.22) 


Remark 3.2. Note that terms pp, and p generally vanish, as we are working in a 
distribution-free framework so that no prevention exists against the possibility 
that X follows a continuous distribution law. a 


Computing the hypotheses 165 


Equations 3.21 and 3.22 give a definite and simple solution to the problem raised 
by Vapnik in connection with the enunciation of the Structural Risk Minimiza- 
tion principle [Vapnik, 1995]. For short, the problem reads as follows: how more 
efficient is it to learn relations between data with a correct yet very complex 
formula than with an approximate but simpler one? The problem boils up in 
terms of the parameters up and tp (high un and low tp in the first alternative, 
and vice versa in the latter). Equation 3.21 says that these parameters sim- 
ply add up to fix the distribution law of the probability of computing a wrong 
output in the future. 

Note that up may be very high, even greater than the sample size itself, as 
occurs in the last point of Example 3.2. In many operational cases it grows with 
the number of variables considered in the learning problem. This is the case 
where we want to learn a b_rectangle in an n-dimensional space, having up, = n, 
and still worse in case we want to learn a k-CNF, since up is a polynomial in the 
number of propositional variables. Furthermore, if we do not state any bound 
on the number of variables in a single clause, so that k = n, then the above 
parameter becomes exponential in n. When the number of involved variables 
grows, un generally increases, and this decreases the lower bound on Fy., (3) 
in (3.21) and (3.22). It is a drawback that Bellman denotes as the curse of 
dimensionality [Bellman, 1961]. 

The way of reducing the detail of a class passes through the injection of 
tighter constraints on its concepts. Typically the class C made of concepts 
each described by a union of at most 5 intervals on the real line has fewer 
constraints (and more descriptional power as a reward) than the class C’ where 
the concepts are constituted by unions of at most 4 intervals. This reduction 
may descend from additional information we have about C. Thus we receive a 
net benefit from having studied the learning task in greater depth. Where such 
deeper study is lacking (or impossible), we might decide to simplify the class at 
a (possibly) controlled cost in terms of consistency loss, i.e. accepting that the 
hypothesis volountarily mislabels some examples. Within the classical inference 
framework this approximation management problem reads as a bias-variance 
problem [Mitchell, 1997], an elementary instance of which we met when we 
searched for a minimal mean square error (MSE) estimator of the parameter 
o? of a Gaussian variable (see Example 2.27). Following Definition 2.12 an 
unbiased estimator exactly minimizes the MSE computed over the parameter 
distribution law, and, after (3.21), the terms to balance — sentry and mislabeled 
points — stay in the sample. 

In addition, we do not care why the points are mislabeled, provided that 
the same cause remains in the future with the same effects. This means that 
the same inequalities on Fy,.,, (3) hold also if the wrong labels come originally 
with the sample. Such is the case where the example supplier (metaphorically a 
teacher) makes disingenuous errors that we can model with a coin toss: for each 
example the teacher tosses a coin; if it comes up heads he affects the example 
with the correct label; if it comes up tails he inverts the label. We virtually 
extend this mechanism to the future population as a source of the errors included 
in Ø. Things are different in the case of malicious errors where a dishonest 


Detail and, 
accuracy simply 
add for 
parametrizing the 


error distribution 
law 


As many 
variables you 
consider in the 
learning problem, 
so much the 
detail might 
increase 


The antidote is 
to better study 
the problem 
(injection of 
formal 
knowledge) 


The bias-variance 
problem 


Voluntary or 
involuntary errors 


An incremental 
bias-variance 
negotiation 


with a dovetail 
issue of PAC. 


For SVM 
negotiation is 
centered around 
a single 
parameter n 


166 Computational learning 


teacher deceitfully decides when to give a wrong label to the examples. In this 
case we have no way of transferring the error producer to the future population; 
rather we might reason in terms of worst cases in very specific scenarios beyond 
our present scope. 

Balancing sentry and mislabeled points cannot be done a priori in general 
since both are statistics of the sample in hand. Thus we may consider the 
outcome of dynamical procedures, like the following ones: 


Definition 3.6. Given a concept class C on ¥ and a nested family of classes of 
hypotheses H; such that H; C H;+1, a Hlarge_as_needs (Hln) learning procedure 
[Apolloni and Malchiodi, 2001] & is defined by the steps described in Algorithm 
3.4 for a given Zm. 


Algorithm 3.4 H_large_as_needs procedure. 
1. Start with k = 0 and H = Ho; 


2. Fori=Otok 


3. Search for an almost consistent hypothesis h € H; 


4. If the number v of points x such that c(a) Æ h(x) equals i, then Stop; 


5. Set k =k+1,H = Hk and go to step 2. 


The efficiency of the dovetailing algorithm relies on a suitable indexing of H. 
It is obvious that H; detail grows with i. Thus, starting with the most elementary 
concept class Ho, we are drawn to increment its complexity if the number of label 
mistakes is too high. In turn, the more the complexity increases the more we 
are pushed to be more tolerant on this number. After fixing thresholds t and 
t” to this number we can rely on the above relations (3.21) and (3.22). 


Definition 3.7. A variant of SVM algorithm, known as soft-margin classifier 
[Cortes and Vapnik, 1995], produces hypotheses for which the consistency re- 
quirement is relaxed, introducing a parameter 7, whose value quantifies the 
degree of this relaxation. The corresponding optimization problem is essentially 
unchanged, with the sole exception of (3.18), which now becomes 


sasn i=1,...,m (3.18’) 
E 


3.3 Further biases 


3.3.1 Distribution dependent complexity indices 


Note that results enunciated till now are distribution-free, in the sense they do 
not assume any bias on the future sequences distribution law. However in some 


Further biases 167 


cases the operational framework may be so well-known that some constraints 
may be put on this distribution law. For instance a mail server might filter only 
messages below a certain length. This means that this variable is bounded and 
those that relate to it have some bias, too. Now, the upper bound in (3.21) 
is untouchable, possibly becoming useless for some © discrete distribution law. 
For such a distribution law, enlarging c+h to reach a measure € you could jump 
directly to a probability greater than it. Thus the sole change of c+ h satisfying 
Uczn < € is to remain exactly as is, with a consequent probability inequality 


Ie(kusn m — Rueen + 1) 2 Tet (kusini m — Kucin + 1) 2 Pry, ag (2) (3.23) 


with 0 < e’ < £, which loses meaning for a consistent hypothesis, i.e. for ku. = 
0, as I,(0,m) = 1 for any a. In these cases a stair-shaped distribution law 
should arise from a more detailed knowledge of the distribution biases and their 
relation to the concept class. 

Skipping these cases, we may assume the lower bound as a target we reach 
from above in the right member of (3.21) by saving some sentry points. We 
may do it, for instance, when we exploit some symmetries in the ¥ distribution 
law. Consider the task of learning b_rectangles as in Sec. 3.1.2. Since they are 
convex polygons, we will reason directly on them in place of their symmetric 
differences. We realized in Fig. 3.3(a) that 4 points need to sentinel the expan- 
sion of a four dimensional rectangle. But to check this numerically we needed to 
highly stretching the coordinates distribution laws far from uniform distribution. 
Indeed if points have uniform coordinates we see that the sample-population k’s 
trajectory falls below the dotted curve almost always. This means that only one 
sentry point is necessary. We find a explanation for this fact in the complete 
symmetry of the four-dimensional uniform distribution. For short: an error 
in localizing one edge does not unbalance drastically the error in the product 
space. 


3.3.2 Sentry points vs. support vectors 


Support vectors for a hyperplane have a stringent role: they determine ex- 
actly the hyperplane through the optimal margin learning algorithm (3.16-3.20). 
Hence the SVM’s detail is cheaper than the support vector cardinality. 

We state it formally as follows: 


Fact 3.3. Let us denote with C the concept class of hyperplanes on a given space 
¥ of dimension d and by o = {£1,..., £un} a minimal set of support vectors of 
an optimal margin hyperplane h (i.e. o is a support vector set but, whatever i is, 
no o —{x;} does the same). Then, for whatever goal hyperplane c separating the 
above set consistently with h, there exists a sentry function S on C+H assigning 
toc+h a sentry set made up of a subset of o of cardinality at most min{d, un}. 


Proof. It comes from the fact that binding a concept is less demanding than 
identifying it. As mentioned in Example 3.2, sentineling the expansion of the 


No saving on the 
upper bound 


Possible sentry 
point discounts 


on the lower 
bound 


No more sentries 
that support 
vectors 


Fewer points 
than the entire 
frontier set 
commonly needs 
for sentineling 


The less you are 
accurate in your 

computations, 
the more sentry 
points you need 


SVM is a 
respectful 

learning 
algorithm 


whose accuracy 
may be controlled 
through the 
number of 
support vectors 
used. 


168 Computational learning 


symmetric difference c + h results in forbidding any rotation of h into a h’ piv- 
oted along the intersection of c with h. The membership of this intersection to 
h’ leaves one degree of freedom for known c, hence the need for one sentry point. 
But this point may unluckily prove useless if it falls exactly on the rotation axis, 
what calls for a maximum of d sentry points, where d is the dimensionality of 
the space X. The question is wether we may enjoy to sentinelling of the same 
benefits support vectors profit to reduce their cardinatilty. These benefit comes 
from orthogonality relationships between support vectors and maximum mar- 
gin hyperplane. The answer is yes. According to the sentry points pathology 
described in Example 3.2, we may divide these points in v ones falling in the 
intersection between h and c, that is to say fixing the rotation axis, and one 
binding the rotation. The above axis is constrained by the same relations used 
to determine the optimal margin hyperplane, since the rotation is virtual (we 
manage indeed for forbidding it). Moreover, since we are not interested in this 
moment to the margin optimization problem, but only to binding the symmet- 
ric difference, we do not need extrapoints for locating the hyperplane within 
examples, that means that we will never need more sentry points than d— 1 for 
fixing the rotation axis. m 


Moreover, form the third item of Example 3.2 we learn that the necessity 
of extra points for sentineling c+ h comes from the fact that sample points fall 
exactly in the intersection between the two hyperplanes. Since no way exists 
for having the probability of this event different from 0 if both the sample space 
and its probability distribution are continuous, we really realize these linear 
relations if either the sample space is discrete or the algorithm computing the 
hyperplane is so approximate to work on an actually discretized search space. 

Thus we may conclude: 


Fact 3.4. The number of sentry points of separating hyperplanes computed 
through support vector machines ranges from 1 to the minimal number of in- 
volved support vectors minus one, depending on the approximation with which 
either sample coordinates are stored or hyperplanes are computed. 


Fairly strong surjectivity requirement is not a heavy limitation to the learn- 
ing algorithms. They essentially constitute an algorithmic counterpart to the 
class well behavior requirement introduced in [Blumer et al., 1989] at the basis 
of the classical results. The soft-margin algorithm satisfies the well behavior 
requirement as well. On the one hand our learning procedure relies on a set of 
points that sentinel the expansion of the symmetric difference between concept 
and hypothesis. On the other, the SVM way of computing hypotheses univo- 
cally links them to special sets of points, the support vectors, within a training 
sample. Fact 3.3 claims that we find the former set inside any of the latter sets. 
Therefore an operational corollary of Theorem 3.2 emerges, which we will use 
in depth in the next section. It is the following: 


Corollary 3.2. For 


e a space X and any probability measure P; 


Further biases 169 


e a concept class C consisting of hyperplanes; 


e a soft-margin algorithm & : {Zm} — C computing hypotheses within the 
same class on the basis of labeled samples Zm; 


a hypothesis h = & (Zm) using at most kn minimal support vectors mis- 
classifying at most tn and at least ti, points of cumulative probability pn; 


e any P E (pn, 1) 


denoted Ucn and Fy,., as in Theorem 3.1, 


Ig(1+t,,m—t,) > Fu..,(8) > Ia(kn + th,m— (kn + tn) +1) (3.24) 


Let us denote by k, t and p the maximum of kn, tn and pn over h, respectively, 
then for each (Zm,h) and B € (p,1) 


Ig(1,m) > Fv,..,(B) > Ig(k+t,m— (k +t) +1) (3.25) 


3.3.3 On-line learning 


Since computing a hypothesis is a complex procedure generally on a long se- 
quence of examples, it is a sagacious strategy to configure the computation as 
an on-line procedure. This means that it evolves through an iterated improve- 
ment of a current hypothesis based on the current example made available by 
the sequence. In the learning algorithms discussed up to now we may identify 
two phases: inconsistency removal that passes through the identification of the 
sentry points of the ongoing hypothesis; and selection of one of the consistent 
hypotheses on the basis of further points, which joined to the former constitute 
the signature of the computed hypothesis. Thus the on-line implementation 
of these phases requires a quick updating of these two sets of points on the 
basis of the new example and a few ancillary data. This is a general attitude 
of the greedy algorithms that not all learning procedures may have. For in- 
stance, learning the set of roads involved in the minimum length trip between 
a certain number of cities in a road grid, that constitutes a part of the solution 
of the traveling salesperson problem [Nemhauser et al., 1989], cannot be done 
incrementally over a certain approximation [Hochbaum, 1997]. In particular, 


Example 3.7. The procedure in Algorithm 3.3 updates the sentry points set at 
each arrival of an inconsistent point. As mentioned in the last point of Example 
3.1, each monotone clause is sentineled by just one point, and it is easy to 
see that v points need to sentinel a CNF made of v clauses. In this case the 
signature of the final formula coincides with the sentry points’ set. Thus the 
latter set is learnt as well as a procedure fallout. 

E 


Learning a convex polygon may be not manageable on-line. Indeed 


At the incoming 
of the new 
example we 
update the 
hypothesis on the 
sole basis of its 
previous version 
and a few 
ancillary data 


A simple instance 


A difficult one 


Ancillary 
information may 
help 


170 Computational learning 


ha} ha 
ESES eo: _¢@ 
hy i 
| hg 
| 


(a) (b) (c) 


Fig. 3.10: Binding the symmetric difference among b-rectangles in case of: (a) concepts 
contained in the hypothesis, and (b-c) hypothesis and concepts partially overlapping. 
c: target rectangle; h;’s: possible hypotheses. 


Example 3.8. Consider the task of learning a b_rectangle in R? as in Sec. 3.1.2. 
In this case you need at most two sentry points to sentinel the symmetric dif- 
ference of the current hypothesis with a target rectangle: two in the case that 
the latter includes the former (see Fig. 3.10(a)) and one (not in the cross of the 
two concepts) in the case that they overlap only partially (see Figs. 3.10(b) and 
(c)). Here too the on-line mode is trivial: we compare the incoming positive 
example coordinates to decide whether it substitutes one of the actual sentinels 
or is useless. In a similar way we may update the signature. For instance, if 
the concept is an intermediate rectangle between the minimum one consistent 
with positive and maximum consistent with negative points, we will use two 
points for sentineling (say the rightmost and the uppermost positive) and the 
analogous negative extreme points for completing the signature. Things are 
different if the rectangle edges are not obliged to be parallel to the coordinate 
axes (a task definitely harder [Blum and Rivest, 1992]). In this case, at each 
new example we must reconsider all the current sets of examples to compute 
the new rectangle orientation and sentry points. 


If available, we may exploit ancillary information to improve the efficiency of 
this process. It might just relieve the drawback of not relying on the entire 
set of examples, or constitute an extra information at all. As for the former 
category, we may know in advance the whole set of examples available, except 
for their labels, that are supplied at run time, and decide the sequence by which 
to process the examples so as to save computing time. This is possible if we 
have a notion of distance of the sampled points from the contour of the concept 
to be sentineled. A trivial example is the following: 


Example 3.9. [Ben-David et al., 1997] For ¥ being the unitary interval [0, 1], 
and 0 < a < 1, let us consider the class C whose generic item is described by 


Further biases 171 


Algorithm 3.5 List augmenting. 
e Start with h equal to the empty set Ø; 


e on each example x not satisfying h add a monomial d joining each v; such 
that x; = 1; 


e erase from d all literals allowed by the Oracle, i.e. such that the remaining 
monomial d’ is still included in the target DNF; 


e add d toh. 


the threshold function Ca : 


1 ofr>a 
caz) = { 0 otherwise (3.26) 


For short, C is the class of intervals (a,1]. It is obvious that for any sample For instance a 
set {1,...,@m} in case positive examples are fewer than negative examples the Se He 
best processing order is the decreasing one (Peay nee £a): At each time i the 
current hypothesis coincides with the segment (x;;), 1) having the last processed 


item as left extreme, until the supplied label shifts from 1 to 0. 


We could stress this facility by asking that the learner decides on his own the 
next example to be fed [Goldman and Sloan, 1994]. This is a typical attitude of 
the active learning algorithms. But in this case we lose the randomness of the 
example set that is at the basis of the present learning framework. 
Remaining in this framework, we may assume to be allowed to ask relevant or an Oracle. 
information about the ongoing hypothesis from an information source out of 
our computing system, which we therefore call an Oracle. That is the case of a 
famous example where we learn a monotone Disjunctive Normal Form as follows 
[Valiant, 1984]. 


Example 3.10. Using same notation as in Example 3.5, we denote a formula 
as monotone if no negated literal appears inside it. The list augmenting algo- 
rithm proposed by Valiant to learn monotone k-term DNF’s is the outlined in 
Algorithm 3.5. 

Note that the formula we obtain at the end of the run is the maximal DNF h 
consistent with the examples, in the sense that no DNF still consistent with the 
examples exists whose terms strictly include the h’s. If we assume that the query 
to the Oracle lasts a constant time, then the whole algorithm is polynomial if 
applied to a polynomial number of examples. 


Not all batch 
mode algorithms 


„can be converted 
in on-line releases 


_ Confidence 
regions as usual 


Upper bounds 
depend on the 
sum complexity 


+ number of 
errors 


172 Computational learning 


Of course, an algorithm computing consistent hypotheses in one of the previ- 
ous on-line protocols can do it also in batch form (i.e. getting the whole examples 
set available at the beginning of the learning procedure), while the vice versa is 
not guaranteed in general. Many theorems exist however for identifying classes 
of concepts for which also the second implication is true [Ben-David et al., 1997]. 
In the last chapter of this book we will study some general purpose on-line al- 
gorithms that are suitable when the concept class is unknown. 

Note how in this framework the consistency property of an estimator takes 
a new pregnancy. If our on-line procedure generates consistent estimators it 
means that we are converging to a more and more approximate hypothesis, 
thus ensuring the convergence in probability considered in Definition 2.14. 


3.4 Controlling the error 


At this point we have the main tools available, algorithms computing hypotheses 
and distribution laws of their differences with concepts, to fill the true learning 
goal: avoiding giving wrong answers to future questions. As usual in statistics, 
we will manage the forecasting of our answering capability with two formats: 
confidence intervals and point estimators, having the number of examples used 
to achieve this capability as companion operational parameter. 


3.4.1 Confidence intervals for the learning error 


Inequalities (3.10) and (3.11) fix a lower and an upper bound for Fy,.,, in terms 
of the c.d.f. of two Beta variables. We can use these distributions in the way 
shown in Chapter 2 for inferring the value of U.-;. In particular the solution 
of an inverse problem like in Sec. 2.2.1 gives rise to the following Lemma: 


Lemma 3.3. For a space ¥, concept and hypotheses classes and learning al- 
gorithm specified as in Theorem 3.2, a confidence interval of level 6 for the 
symmetric difference Ucn is the interval (li, ls) such that l; is the 6/2 quantile 
of the Beta distribution of parameters 1+t, and m — tn, andl, is the analogous 
1 — ô/2 quantile for parameters un + tn and m — (un + tnh) +1. 


To get companion graphs of those reported in Fig. 2.23, we take into account 
in Fig. 3.11 a third coordinate given by the detail up, where mislabeled points 
represent the statistic on which to pivot the specifications of Uen. The explicit 
analytical expression of the extremes y1, y2 of the two-sided confidence interval 
at level 6 for Ucp comes from the equations: 


3 (") yo(1 — y)” = 1 — i (3.27) 


i=Hh tth 


2 “A yill- y)" = ° (3.28) 


i=tpt1 


Controlling the error 173 


Fig. 3.11: Comparison between two-sided 0.900 confidence intervals for error proba- 
bility. x-axis: number of misclassified points. y-axis: VC dimension and class detail. 
z-axis: confidence interval limits. Light surfaces: confidence intervals based on the 
VC dimension. Dark surfaces: confidence intervals based on the class detail. Sample 
size: (a) m = 100, (b) m = 1000, (c) m = 1000000. 


Similarly, we obtain a one-sided confidence interval (0, y2) from (3.21) through: 
m m ; ; 

(“a -wyr = 1-5 (3.29) 
i=Hh tth 

When m grows, as the numerical solutions of these equations become difficult to 
handle, the Binomial distribution underlying (3.27-3.29) can be approximated 
with a Gaussian law, following the Central Limit Theorem (see Theorem 1.4). 
In this case the equations for the two-sided interval read 


Aago 
Fz I (5) mya(1 — y2) = bn + th — mye (3.30) 


ô 
Fz“ (1- z) myı(l — y) = tan + 1 — my (3.31) 


respectively, where Fz denotes the c.d.f. of a Normal variable Z. 

To cross these graphs with the usual results in the literature we also consider 
the well known and commonly used confidence intervals stated by Vapnik and 
Chervonenkis since the late ’70s. They refer to the random variable Rp+e i.e. the 
measure of the symmetric difference between a fixed c and an h varying with the 
extracted sample (commonly denoted as actual risk). These results are based 
on another complexity index of a concept class that is defined as follows: 


Definition 3.8. [Blumer et al., 1989] Given a concept class C and a finite set 
Q C X, let IIc(Q) denote the set of all subsets of Q that can be obtained by 
intersecting Q with a concept in C, ie. He(Q) = {QNc,c E€ C}. If #c(Q) = 
2#@ then we say that Q is shattered by C. The Vapnik-Chervonenkis dimension 
of C (or VC dimension of C for short), denoted dyc(C), is the largest integer d 
such that at least one set of d items is shattered by C; if no such d exists, then 
dyc(C) is assumed to be infinite. | 


If we consider 
fixed population 
and random 
samples, R;,=c 
in place of Ucp 


Aless | 
constructive 
complexity index 


tightly linked to 
De,c- 


174 Computational learning 


Fig. 3.12: Comparing detail and VC dimension for the class of oriented half lines. (a) 
One point binds the symmetric difference; dashed line: union of the elements in the 
class, thick lines: half lines, the difference between gray and black ones is sentinelled 
by the diamond. (b) a configuration of two points A and B not shattered by half lines 
like c. 


Although semantically different, the two complexity indices are strongly re- 
lated by the following: 


Fact 3.5. For any concept class C, related VC dimension dyc(C) and detail Dc 
it holds: 
Dc < dya(C) +1 (3.32) 


Proof. The inequality descends from the following theorem: 


Theorem 3.3. For any concept c € C and sentry function S on C, the set S(c) 
is shattered by CU {0, ¥}. 


O 


Fact 3.6. For any concept class C and related VC dimension dyc(C) and detail 
Dc it holds: 
(dvc(C) — 1)/176 < Dec < (dvc(C) + 1) (3.33) 


Proof. Right inequality descends from the Fact 3.5 and the following. 
Fact 3.7. For any concept class C, and related VC dimension dyc(C) 

dyvc(C) = dvc(C + C) (3.34) 
Left inequality will be derived later (see the following Remark 3.3). o 


Example 3.11. It is easy to see that the class of symmetric differences between 
oriented half lines in Example 3.2 is constituted by segments with a fixed extreme 
determined by a given c and the other varying with a second concept c’ spanning 
in C. Thus only one point is sufficient to control the motion of the latter extreme, 
and Dc,c = 1 (see Fig. 3.12(a)). It is also easy to check that no pair of points of 
the real line might be shattered by C. In the case of right bounded lines as in 
Fig. 3.12(b), indeed we have no c including point A and not point B. Thus the 
VC dimension of C is 1. By Fact 3.7 the same holds for C+C, hence dyc(C +C) 
is 1. Ps 


Controlling the error 175 


200000 400000 600000 800000 4.198 


(a) (b) (c) 


Fig. 3.13: Same comparison as in Fig. 3.11 for class complexity equal to 4. a-axis: 
number of misclassified points. y-axis: confidence interval limits. Black lines: dvc (C)- 
based confidence intervals. Dark gray lines: Dc+c-based confidence intervals. Light 
gray lines: Dc+c-based confidence intervals obtained using a Gaussian approximation. 


Actually the two indices constitute dual ways of characterizing the complexity Either | 
iscriminate 


of a class. In short, the detail counts the number of points able to discriminate concept with 
a concept within a class, while the VC dimension counts the number of concepts pa es ae 


necessary to discriminate subsets of points within a set. The latter index is at points Te 

the basis of the majority of standard PAC learning theory, and in particular of aa aa 

the following theorem fixing broad bounds which come from Chernoff inequality 

(see footnote 17). 

Theorem 3.4. Let C be a Boolean concept class of bounded VC dimension In the latter case 
the confidence 

dvc(C) = d, and let v(h) be the frequency of errors computed from the sam- region reads as 

ple for a hypothesis h € C. Then form > d and simultaneously for all the igi 


hypotheses in C, both the events 


d (In 2 +1)- n$ 


v(h) =2 < Rre < 
m 
d (In 2% +1) — ln 
v(h) +2 a(n =f +1) -n5 (3.35) 
m 

Raze < v(h)+ 

d (in 224-1) = In h 

9 (In=F +1) — Ing 1/14 ale (3.36) 


m d(n*##+1)-Ing 


have probability 1 — ô. 


Figure 3.11 compares the confidence intervals coming from the two ap- An improper but 


proaches. To this end we equalize Rp+c with Us- and artificially fill the SA@insn 


gap between the two indices by: 1) referring to both complexity indices 


The exact 
knowledge of 
Fu.. p Pays 

more than 

7 Chernoff 
inequalities on 
Rre 


A coverage check 
through SVM 
instances 


176 Computational learning 


and empirical error constant with concepts and hypotheses, and 2) assuming 
Dec = dyc (C) = k. 

The figure shows how the confidence intervals of the above random variables 
depend on the related complexity index and the number of misclassified points. 
Following the previous remark, we compute the latter quantity in (3.35) as mv. 

For samples of 100, 1000 and 1000000 elements respectively, the three graphs 
show the limits of the 0.900 confidence intervals drawn using both dyc(C) (ex- 
ternal surfaces) and Dc (internal surfaces) bounds. Moreover, as the graphs 
show a relatively weak dependence on the complexity indices, to appreciate the 
differences even better in Fig. 3.13 we draw a section with k = 4 in function of 
the number of misclassified points. For drawing our confidence regions we used 
both dark gray lines for plotting bounds from (3.27-3.28) and light gray lines 
for those from (3.30-3.31). Note that these different bounds are distinguishable 
only in the first figure. In fact, Fig. 3.11 (a) is drawn using both sets of equal- 
ities, while Figs. 3.11(b) and 3.11(c) come directly from equalities (3.30-3.31). 
The figures show that detail-based confidence intervals are 


e always more accurate than dyc’s; this benefit accounts for a narrowing of 
one order at the smallest sample size while tends to disappear when the 
sample size increases. 


e feasible, that is they are always contained in [0, 1]; 


Example 3.12. Specializing Lemma 3.3 to SVM, in Fig. 3.14 we check the 
coverage of the above intervals through a huge set of pairs of statistics and 
error probabilities uc+n sampled from learning instances on points uniformly 
distributed in the unitary hypercube and variously separated by random coeffi- 
cients hyperplanes. Namely in Fig. 3.14(a) we considered different sample sizes 
for the number of support vectors fixed to 3, and in Fig. 3.14(b) we conversely 
maintained the sample size fixed to 100 and considered different numbers of sup- 
port vectors as upper bound to Dic.y), reported in abscissa. The slight oversize 
of the intervals on each abscissa is connected with Fact 3.4. Indeed from the 
graph in Fig. 3.14(c) of the percentage of points trespassing the confidence inter- 
vals with the accuracy 7 of the learning algorithm (see (3.18’)) we see that the 
design parameter 6 = 0.1 is stably reached with the increase of this parameter, 
i.e. with the decrease of accuracy. Note that for 7 > 0 the algorithm works on a 
superset of the support vectors, thus we rely on the upper bound n (the dimen- 
sion of the sample space) to their cardinality. Finally, in Fig. 3.15 we draw the 
same confidence region in case the SVM tries to divide two regions non linearly 
separable. In this case we use a parabolic surface for dividing the hypercube 
points, hence to label the sample as well. This may induce some mislabeling by 
the hypothesis even on the sample points. According to (3.21), we afford this 
case just by adding the number of support vectors and the number of mislabeled 
sample points in the abscissa of the graph and using a parameter 7 = 0.2. We 
have the same confidence intervals in correspondence of the abscissas and uc=p, 
values well contained in these intervals as well. 


Controlling the error 177 


% wrong pts. 


0.05 0.1 0.15 0.2 0.25 0.3 0.35 


” 


(c) 


Fig. 3.14: Course of misclassification probability Uc-n with the parameters of the 
learning problem: (a) probability vs. sample size, (b) probability vs. number of 
sentinels, (c) rate of uc+p measures outside confidence region with the approximation 
parameter 7 of the SVM algorithm. 


3.4.2 Sample complexity 


As seen in Sec. 2.3.3, the inverse operational problem when we deal with con- 
fidence interval is the identification of the sample size sufficient to have a tight 
enough confidence interval for the error probability U.-;, with a sufficient con- 
fidence. If we link this value to a relevant dimension of the learning task we 
obtain a function of this parameter that we call sample complexity. 


3.4.2.1 Distribution-free complexity 


Identifying sample complexity without bias on the ¥ distribution law amounts 
to reversing inequalities (3.21) on m. Since it is reasonable to look for a one 
(right) sided interval with a given confidence 1 — 6, we fix to this value either 
Ig(1+tn,m-—tp), to obtain a lower bound on the sample complexity or Ig (Un + 
th,m — (un + tn) +1), getting an upper bound to this function that we call 
sample complexity tout court. In such a way we obtain the following. 


What size needs 
a labeled sample? 


From the lower 
bound on FG a 
to an upper 
bound on sample 
size 


178 Computational learning 


Hh + th 


Fig. 3.15: Same picture as in Fig. 3.14(b) for non linearly separable instances. Abscissa: 
sum of detail plus number of wrongly classified sample points. 


Corollary 3.3. For 
e a space X and whatever probability measure P; 
e any concept classes H and C on X; 
e any fairly strongly surjective function & : {Zm} => H; 
e a sample Zm; 
e any pair0<<¢,6 <1; 
if 


e for any infinite suffix Zm of Zm ac E€ C exists computing the example 
labels of the whole sequence; 


e h= A% (Zm) has detail Dic), = uh and number of misclassified points tn 
of probability pp; 


5.5(unattn—1) } 17. 
Fi 


2 
e mz max{ zma E 


17 The algebra underlying this result passes through the Chernoff inequality [Chernoff, 1952] 
saying that 


d 
: : d 
DD (Mta -emi < (=) en metd for d < me (3.37) 
=n wt d 
which translates our target in 
un+tn—1 
( me ) feel e` metunttn—l < § 3.38) 
Hh +t, —1 ~ 
i.e. àj i i 
t} = 
motett in( = JE n(Z) 3.39) 
E Hh tt, —1 E ô 
If we split the target in 


tph — 1 
mos mtn (e) 3.40) 


179 


Controlling the error 


then 


P(Uczn < max{e,pn}) > 1— ô (3.44) 


Because of right inequality in Fact 3.5, upper bound on sample com- 
plexity is lower here than in analogous well-known results. Indeed the 
usual bound based therein on VC dimension reads for th = 0: m > 
max{4/eIn(2/5), (8dvc/e) n(13/e)} [Blumer et al., 1989]. Thus, since Doc < 
dyvc(C) + 1 after Fact 3.6 our upper bound gains a factor In(1/e) in respect to 
the canonical bound when H = C. 

The lower bound may prove weaker. Indeed, in respect to the usual lower 
bound for t, = 0: max{In(1/5)(1/ — In(1 — ¢€)), (dve — 1)/32e} for 0 < £ < 1/8 
and 0 < 6 < 1/100, we can fix with our approach only the first term of the max 
operator!’. The tighter lower bound depends on the fact that the results using 
VC dimension consider expressly a family of worst case distributions in place of 
all possible (better case included) distributions of the sample suffixes. It is the 
mentioned framework of the test of hypotheses: fix the world at the worst case 
distribution and check if a concept class can be learnt. On the contrary, the 
argument through which we fix the the Fy., lower bound, hence the complexity 
upper bound, matches the worst case conditions. These identify with the worst 
case assumption on the sentry points cardinality for sentineling c + h (see Sec. 
3.3.1 for weaker conditions). Thus putting the two (upper and lower) worst case 
bounds in contrast, we obtain the following theorem: 


Theorem 3.5. For any concept class C+H and any probability measure P on %, 
in the worst case condition the ratio r between maximum and minimum num- 
bers of examples needed to learn C through consistent hypotheses with accuracy 
parameters 0 < € < 1/8, 0 < 6 < 1/100 and tn mistakes is bounded by a 
constant. 


Remark 3.3. This result solves in part the old question of whether or not we can 
save sample size with a smart learning algorithm. The answer is NO in worst 
case conditions, apart from a constant multiplicative gain. From the proof of 
the theorem the left inequality of Fact 3.6 also emerges easily. 


a 
m 1 
— = -ln= 3.41 
2 e oO ( ) 
where the former is satisfied by 
th -—1 
neg (3.42) 
Ê 
and the second by the obvious solution, we obtain the sufficient condition for m 
tp—1 2, 1 
m> max {5.542 Zin 2h (3.43) 
E e 6 
18We get this result just by imposing 
1-(l1-e)"™>1-6 (3.45) 


A cheaper upper 
bound than with 
VC approach 


A tighter lower 
bound on 
complexity comes 


from the VC 
approach 


Only a constant 
factor 
distinguishes the 
upper bound on 
sample 
complexity from 
its lower bound 


No way for 
skillful algorithms 


in the most 
difficult cases 


The operational 
consequences 


Contrasting a 
concept with the 


others of a same 
class 


Few comparison 
if we limit them 

to concepts 
distinct enough 


A way for 
highlighting only 
templates of the 

concepts in a 
class 


A recoverable 
class is a 

__ learnable class 
with a polynomial 
sample 


180 Computational learning 


Still in an operational perspective we can synthesize these sample complexity 
results through the following Theorem: 


Theorem 3.6. In the same hypotheses of Theorem 3.2, for each O < ¢,6 < 1, 
in case m > max{2/eln(1/8),5.5(un + tn — 1)/e} if Z% is a fairly strongly 
surjective function from {Zm} to H, then & is a learning algorithm with accuracy 
parameters max{pp,e} and 6 for C. Conversely, if m < In(1/8)/(— In(1 — «)) 
no any & can be a learning algorithm with accuracy parameters £ and ô. 


3.4.2.2 Distribution dependent complexity 


More favorable results on sample complexity exist in case some information is 
available on the ¥ probability distribution. As mentioned before, in this case 
we can put in connection the symmetries of the distribution with the shape of 
the concepts. We generally do it directly within the Kolmogorov framework in 
a sort of extended Bernoulli experiment. The general problem is: how many 
experiments must we do in order to discover and discriminate all relevant parts 
of a concept in contrast with the others of the same class? In this framework a 
key role is played by the notion of ¢-cover as follows. 


Definition 3.9. Let C be a concept class on ¥ and P be a probability measure 
on ¥. Cz C C is an e-cover of C w.r.t. P if for every c € C there is d € Cz such 
that c’ is e-close to c, i.e. P(c+c') < e. We denote by N (C, £, P) the cardinality 
of the smallest ¢-covers of C w.r.t. P. The class C is finitely recoverable if 
N(C,¢,P) < +00 for each e > 0. 


Example 3.13. Consider the class C of b-squares on ¥ = [0,1]? (an obvious 
extension of the b_rectangles introduced in Sec. 3.1.2). If we consider the set 
of b_squares whose edges have length Vie, for i = 1,..., |4] "°, it is easy to 
see that they represent an e-cover of C, as illustrated in Fig. 3.16. Indeed, it is 
easy to see that the difference among two consecutive b_square in the figure is 
exactly €. 


m 
A general theorem states that 


Theorem 3.7. /Benedek and Itai, 1988] Given a set ¥ and a probability measure 
P on it, every concept class C finitely recoverable is learnable with accuracy € 
and ô from a set of m labeled examples with 

2 N P 

m> 32 ln N(C E, P) 

E â 
with an algorithm which selects a hypothesis h from Ceja with the minimum 
number of labeling mistakes on the set of examples. Any algorithm learns C 
with at least 


(3.46) 


m = (1 — ô) ln N (C, £, P) (3.47) 


19where |a| denotes the floor of a, i.e. the greatest integer lower or equal a. 


Controlling the error 181 


Fig. 3.16: An e-cover Ce for the class C of b_squares in [0, 1]?. 


Proof. (first part only) Let bn be an element of C./2 which is ¢/2-close to a 
target c € C, and {b1,...,bn} all the e-far elements of Ce (in number n < N = 
N(C,¢,P)). We want to sample ¥ so intensively that ba + c will contain fewer 
points than any symmetric difference b; +c, at least with probability 1— ô. More 
precisely, we look for the joint events A : “ba + c contains fewer than 3/4em 
points”, and B; : “bi +c contains more than 3/4em points” Vi € {1,2,...,n}. 
The probability p* of this event is 


p’ = P (AN (Nix Bi)) = P(A— UL By) 2 P(A) — 3 (1—P(Bi)) (3-48) 


i=1 
and, denoting Bmin = arg ming, P(B;), since n < N — 1, we have 


p* > P(A) — n(1 — P(Bmin)) > P(A) — (N — 1) (1 — P(Bmin)) (3-49) 


with 
[3/4em] = 1\3 1 \m-i 
P(A) > 2 (") G) (1-3) >1—e7l™ž] (3.50) 
P(Bmin) > 5 C) e(ice Sie lPi] (3.51) 


i=[3/4em+1] 
The sum of these terms is greater than 1 — 6 for 


32, N 
> — ln — .52 
m> -h (3.52) 


Another simple algorithm allows the following statement. 


Theorem 3.8. /Benedek and Itai, 1991] Given a set ¥ and a discrete probability A class on a 


it with any accuracy € and ô. 


h í ; discrete space is 
measure P on it, for every concept class C there exists an algorithm that learns a 


learnable class 


The class of 
monotone 
monomials with 
uniform input is 
learnable also 
with a 
moderately 
wrong labeled 
sample 


An inequality for 
the unbiased 
estimator 


182 Computational learning 


The size of an ¢-cover is also the key for facing sample complexity in presence 
of disingenuous errors, when their amount is appreciated in terms of rate (thus 
of distribution law) over the sample size in place of maximum number. A typical 
sentence in this framework may be the following: 


Theorem 3.9. /Apolloni and Gentile, 1998] For ¥ = {0,1}", P the uniform 
distribution on it, let C be the class of the monotone monomials on X¥, and n 
the probability of having a wrongly labeled example. Denoting | = In(1/6e), for 
n < 1/2, < 1/8 and any 6 < 1 and (n/l) large enough any learning algorithm 
a needs a sample size 

In (n/L) 


m> X on) (3.53) 


for a suitable constant x. 


3.4.3 Point estimators for the learning error 


Moving to point estimation, we gauge the expected value of U.=;, between two 
estimators. 

They represent the expected values of the two Beta distributions binding 
the cumulative distribution function of U.=,;, hence directly dependent on the 
size of the training set. 


Fact 3.8. With same notations as in Theorem 3.2, the expected value of Uc=n 
is such that: 


lth Dien), + tr 
— < E|U--,| < — 3.54 
LE < B Ues] < -C (3.54) 
Example 3.14. For optimal margin hyperplanes, as in Corollary 3.2, 
1+tn Uh + th 
— < E|U.-,| < —— 3.55 
LEM < EUes] < E (3.55) 
In case of separable samples we have tp = 0. Moreover we find in the 


literature another bound on the parameter in hand, obtained by the so called 
Leave-One-Out technique [Vapnik, 1998]. Therefore, in the limits of merging 
our with Kolmogorov framework, we might state the upper bound 


D2 

E|Uc-n] < min al r (3.56) 
m+] Pm 

where pm and 2m are respectively the margin of the optimal separating hyper- 

plane and the maximum norm over the minimal sets of support vectors. 


This concerns any sample and therefore we can use the upper bound on E[Uc-n] 
for predicting from above the cumulated loss in a long history if a linear cost 
on c+ h penalizes our mistakes. The sole additional point is that we must have 
a mean value for the ratio 22 /p?, w.r.t. this history. Thus 


Splitting the sentineling functionality: learning by pair of concepts 183 


0.012 


Fig. 3.17: Graph of the variance with u + t of the Beta variable with c.d.f. Ig(u + 
t,m —(y+t) +1) -— i.e. the upper bound from (3.58) — for m = 20. 


Fact 3.9. 


p2 

E*[Ucss] < ming 2, E' | 2e (3.57) 
m+1 Pa 

where E* denotes the double averaging operator and E’ the simple averaging on and for related 

the sample history. 


To control the reliability of our prediction, we can also consider the variance 
of Ucn. Still comparing it with the variance of the gauging Beta disributions 
we have: 


Fact 3.10. 
(1+ tn)(m — tn) (Ha + th)(m — (un + th) +1) 
mpima Ve Sim o G 


Looking at Fig. 3.17 we trivially notice the parabolic behavior of the upper 
bound, which reveals the increase of this parameter as a second drawback when 
either the class complexity or the number of misclassified examples increases 
within the first half of m. 


3.5 Splitting the sentineling functionality: learning 
by pair of concepts 


Let us start by giving the dual definition of inner sentry function, which we will 
denote with s in the following, using sentinels lying inside a given concept c but 
outside the sentineled concepts (now included in c). 


Definition 3.10. Given a concept class C on a space ¥, an inner sentry function A symmetric 


a i r Eni fi ionality: 
on C is a total function s : CU {0, ¥} 1 9X satisfying the following conditions Sc hen naide 


(see Fig. 3.18): 


(i) Sentinels are inside the sentineled concept (c N s(c) = s(c) for all c € C). 


184 Computational learning 


co 
c2 C1 


Fig. 3.18: A schematic outlook of inner sentineling functionality. co concept sentineled 
against c1,C2,c3. {£1, £2}: co frontier. 


(ii) Sentinels are outside the invading concept (Having introduced the sets 
cŒ =c—s(c), T= a°” and dw(c) = {cd € C such that c $ cd and (d) c 
©} 71, if c2 € dw(c1), then T2 N s(c1) # 0). 


(iii) s(c) is a minimal set with the above properties (No s' # s exists satisfying 
(i) and (ii) and having the property that s' (c) C s(c) for every c € C). 


(iv) Sentinels are honest guardians. It may be that d —s(c') C c but @Ns(c) = 0 
so that c’ ¢ dw(c). This however must be a consequence of the fact that 
all points of s(c) are involved in really sentineling c against other concepts 
in dw(c) and not merely devoted to avoiding inclusion of (c') by c. 
Thus, if we remove c’ then s(c) remains unchanged (Whenever cı and cə 
are such that c2 — s(c2) C cı and Ta N s(c1) = 0, then the restriction of s 
to cı Udw(c1) — {c2} is a sentry function on this set). 


s(c) is the inner frontier of c upon s, and its elements are called either sentry 
points or sentinels. The inner detail dc of a concept class C is the supremum 
of the cardinalities of its members’ frontiers, computed with respect to all the 
possible inner sentry functions: 


dc = sup #s(c) (3.59) 


s,c 


where s spans also over sentry functions on subsets of ¥ sentineling in this case 
the intersections of the concepts with these subsets. g 


20for obvious typographical reasons. 
21 The counterpart of the invading concepts set in Definition 3.4. 


Splitting the sentineling functionality: learning by pair of concepts 185 


Fig. 3.19: Splitting the sentineling functionality. The circle and diamond denote 
respectively the inner and outer sentry points needed to sentinel c — h and h — c 
(circles here schematize more complex domains). 


Example 3.15. With the same notation as in Example 3.1, the class of monotone 
monomials has inner detail 1, where each monomial (v;,v;...v,) is sentineled by 
the point x, whose i-th component is 1 if v; is a factor in the monomial, 0 
elsewhere. 


From Examples 3.1 and 3.15 we realize that some concepts are easy to sen- 
tineling from the outside, others from the inside. In addition, with reference to 
Fig. 3.19 we see that sentineling a symmetric difference splits in two separate 
actions: sentineling from outside of h the part h — c and from inside the part 
c—h. For instance, consider the task of sentineling the symmetric difference 
between two monotone clauses. In the case c C h we must just bind the expan- 
sion of h and we know this is done by a single sentry point. In the opposite case 
h C c now we must sentinel from inside a clause. This is a heavier job since we 
need one point for every literal in the clause. Indeed: 


Fact 3.11. Sentineling from the inside a monotone clause takes one sentry point 
for every literal in the clause. 


Thus the detail Dc,c for the class of monotone clauses on n literals is n, 
definitely greater than Dc which equals 1, yet coinciding with dyc(C) still equal 
to n. Similarly, we realize for the class of monotone monomials an inner detail 
dc = 1, outer detail De = n, Dec = dvc(C) = n. In summary: 


e There is no general relation between details of a hypothesis class H and its 
symmetric difference with a concept class C, apart from inequalities like 
the following. 


Fact 3.12. For each H and C 


Du < Du,cugo} < dvc(H) +1 (3.60) 


Sentineling c + h 
requires 
sentineling h 
partly from inside 
partly from 
outside 


Some concept 
classes are easier 
sentinelable from 
one side than 
from the other 


hence a detail of 
symmetric 
difference higher 
than the detail of 
the original class 


Other classes do 
not do so 


A strategy to 
save points: 


two nested 
hypotheses 


a much favorable 
use of the sentry 
points 


three labels: 
positive, 


negative, 
uncertain. 


186 Computational learning 


e Some concept class has the same detail of the class of the symmetric 
differences between their elements and concepts belonging to the class of 
hypotheses. A first family of classes is trivially the following: 


Lemma 3.4. For each concept class C closed under sum and difference of 
their concepts Dc.c = Dc. 


Example 3.16. The power set C of a real line and the class C + C of the 
symmetric differences of its elements both have detail equal to +00. 


Another more useful and widespread family is the following. 


Lemma 3.5. [Apolloni and Malchiodi, 2001] The classes of convex poly- 
gons and the class of circles in a plane have Dc,c = Dec. 


Example 3.17. The class C of the convex polygons with at most k edges 
has Dec = Dc = [3k] and dyc(C) = 2k + 1. 


e Some concept classes have a smaller detail of their symmetric difference 
with other classes. We see that this is true for the classes of monotone 
monomials and clauses. The same happens for CNF and DNF. This is 
why we move from outer sentinels of the clauses to remove some of them 
from the current CNF hypothesis in the algorithm of Example 3.5 and 
from inner sentinels to add monomials in the current DNF hypothesis in 
Example 3.10. 


We focus exactly on classes like the lastly mentioned ones in search of a learning 
strategy working only with the most favorable kind of sentry points. The general 
idea is the following. For given classes C and H consider a pair of classes Ht, H° 
such that: 


(a) HC Hİ and H C H° 
(b) dui & Dc, H; Dho < Dewy 


Then use Hİ for stating a hypothesis that proves minimal ?? on the basis of 
the positive points and H° for stating a maximal hypothesis based on negative 
points. This entails in principle a further constraint, namely that a c € C is 
respectively included in the latter and includes the former. By definition these 
hypotheses always exist if C C H, possibly coinciding with c itself. In this way 
we use at most dpi + Dye points for sentineling a region where a hypothesis 
consistent with the example lies. Using a sample of m (mixed positive and 
negative) points with m > max{2/eln(1/d),5.5(un + tr — 1)/e} we cannot say 


?2in the sense that no other consistent hypothesis in Hİ is included in it. 


Splitting the sentineling functionality: learning by pair of concepts 187 


positive points 


negative points 


Fig. 3.20: Classification using inner and outer borders: points inside hi and outside 
ho are labeled as positive and negative, respectively. Points falling in the gap between 
are said to belong to an uncertainty region. 


that a hypothesis in the gap between hi and h° has an error < 2e with confidence 
> (1—8)? 2°. Rather, we can have this error if we use h' for decreeing a positive 
label to next points, h° for a negative label and saying nothing of points having 
different labels from the two hypotheses (see Fig. 3.20). Things work still better 
if we are able to draw minimal and maximal hypotheses that are surely included 
in c the former and containing c the latter. In this case the above sample size 
m ensures an error < 2e with confidence 1 — 6 if we adopt the same procedure. 
Indeed in this case the h—c C h°—c, c—h C c— hi, and h? —hi = h° —cUc— hi. 
In particular, dealing with Boolean formulas we may we use Hi = DNF and 
H° = CNF?*. Any Boolean formula on a set of propositional variables can be 
represented through either DNF or a CNF on the same variables; thus condition 
(a) is satisfied. To get the second condition we must on the one hand have a 
very complex C but on the other to be allowed to focus on subsets of the normal 
forms, for instance k-term CNF or k-term DNF. 


Example 3.18. [Apolloni et al., 2006b] 


i) given X,, and set "of positive examples, a monotone monomial m with 
arguments in V, is a canonical monomial if an x € &* exists such that 


€set(m) ifa,=1 


for each i € {1,...,n}, vi 
t hs v set(m) otherwise 


ii) given X,, and set &~ of negative examples, a monotone clause c with 


23 actually the errors on the two hypotheses are highly statistically dependent. 
24See Example 3.18. 


A still lower 
sample 
complexity if c 
surely falls in the 
gap between the 
hypotheses 


The building 
blocks 


for building inner 
and outer borders 
of the wanted 
concept. 


Apply distributive 
property on set 
conjunc- 
tion/disjunction 
to jump a level 


Obtain a possibly 
never ending 
abstraction 
hierarchy 


We learn if we 
have time to do 
it 


188 


Computational learning 


arguments in V, is a canonical clause if an x € &~ exists such that 


€ set(c) 


g set(c) 


for each i € {1,...,n}, of cared 
otherwise 
Thei union of canonical monomials constitutes an inner border (the union of the 
thin contoured gray circles in Fig. 3.21). It represents a minimal hypothesis on 
the target formula g*. Similar properties hold for the canonical clauses, whose 
intersection now represents the maximal hypothesis consistent with g*, and then 
an outer border. 
Applying the distributive property to the union of two monomials: 


ViUjUk V VI UMUnUr = 

(Vi V VUm) A (Vi V UnUr) A (VjUk V VUm) A (VjUk V Vnur) (3.61) 
we obtain a new monomial where literals are constituted by clauses and, in 
turn, literals in the clauses are substituted by monomials. We generalize the 
operation, keeping groups of at most k2 monomials and splitting them in such 
a way that at most kı hyperclauses arise. Bounds on k’s stand for requisites of 
conciseness on the formula description, i.e. for a compression of our knowledge 
that constitutes the true inductive bias of our procedure. We do the same with 
hyperclauses. Iterating the procedure, and denoting U! = A if t is odd and Æ 0, 
V otherwise, for t > 0, at the L-th abstraction level we obtain formulas belong- 
ing to the families of hyper_Lmonomials Gn;ko,kı,...,k„ -1 and hyper_L_clauses 
Gn:ko,ki,...,k-, Whose elements g can be written respectively as follows, with an 
obvious shift of the k indices, for v = 2L, ki < k; for each i < v, ki,, € N and 
suitable q 


k! 1 k! 2 k! gr 
1 2 v+ ; 
ga U= U; Urt, Valjo,jı, ju) for the monomials, and 
klai 


s Op T 

Ui 4 peal), 4 Ca ae S, for the clauses 

(3.62) 
In short, we pass from one level to the next adding a pair of operations of the 
“AV” kind for metamonomials and “VA” for metaclauses . Figure 3.21 sketches 
these formulas at the first two levels, delimiting a gap where the contours of 
suitable consistent hypotheses are found. This gap may be variously reduced 
on the basis of optimization criteria possibly under different consistency criteria 
[Apolloni et al., 2006b]. 


3.6 Learnability 


Learning is an attitude sought by the computer ever since early conception of 
this machinery on the part of researchers such as Von Neumann and Turing. 


Learnability 189 


Fig. 3.21: Inner and outer borders, in the sample space, of a concept at two abstrac- 
tion levels. Inner borders are delimited by the union of formulas bounded by positive 
examples (gray circles with thin contour at ground level), outer borders by the inter- 
section of formulas bounded by negative examples (white circles with thin contour at 
ground level). Bold lines describe higher level formulas. 


Beyond classical results aimed at proving the learnability of wide classes of func- 
tions per se without considering the relative computing resources [Gold, 1967], 
the benefit of modern approaches comes from the pragmatical consideration 
that we want to use what we have learned. This means that, as usual, a result 
that could be achieved only after an unbearable computing time or with the 
employment of enormous hardware resources has to be considered just a spec- 
ification of a random variable, thus difficult to use as a single value. Therefore 
with the usual dichotomy, “polynomial is feasible, non polynomial unfeasible” , 
we define learnable a class of concepts requiring feasible resources to be learned. 
Namely 


Definition 3.11. A concept class C on a set ¥ characterized by a complexity 
index n is learnable if there exists an algorithm < which on labeled samples 


Zm, 
1. for each 0 < €, < 1, 


2. if for each suffix of Zm there exists a c € C computing the labels of its 
elements, 


computes hypotheses h on ¥ with measure Uce+p < € in a time polynomial in 
n,1/e,1/ô8. 


Remark 3.4. Example of complexity indices are the maximum number of propo- 
sitional variables involved by the elements of a class of Boolean formulas, the 


Really learnable 
concepts 


_., _, In the 
distribution-free 
framework 


number of 
necessary 
examples 


and time to 
process them 


are nearly 
independent, 


apart accuracy 
effects. 


For special 
distribution laws 


some synergy 
may occur 
between the 
statistical and 


_ the 
computational 
step. 


190 Computational learning 


detail of a class, the dimensionality of the continuous space where the elements 
of a class of convex polygons are defined, the VC dimension, etc. Moreover, for 
greater strictness, we should consider bounds both on the time and the memory 
units, say bytes, used by <. But the latter bound is implied by the former in 
the case of sequential algorithms, the kind of algorithm commonly implemented 
in our computers. 


Looking at the above distribution-free statements we may split the learnabil- 
ity of a concept class into two essentially separate requirements: a statistical 
one concerning polynomiality of the sample complexity in the above parameters 
n,1/e,1/6, and a computational one concerning polynomiality of the algorithm 
for computing hypotheses on a sample in the length of the sample. After Re- 
mark 3.3 indeed we realize that sample complexity is essentially independent 
of the learning algorithm employed, apart from a constant factor, once we have 
fixed the degree of approximation in terms of number of mistakes allowed to 
the hypotheses and the related probability masses. A reasonable index n on 
which to base the class complexity is its description length, i.e. the number of 
characters of a given alphabet we employ to describe the concepts. It matches 
with the propositional variables when the formulas to be learnt have a simple 
structure such as monomial or clauses; otherwise more characters, yet generally 
in a constant number, must be used to describe the structure as well. Note that 
the detail of a class may be polynomial in this length, thus having a polynomial 
sample complexity too; or not polynomial, as for CNF with no bound on the 
number of variables for a clause (which makes sample complexity, and then the 
whole learning task, unfeasible). 

Once we have ascertained that the sample complexity is polynomial, feasibil- 
ity of the learning task just depends on the complexity of the CONSISTENCY 
problem in relation to the concept class in hand. For instance we know that 
learning k-term DNF requires a polynomial sample yet a non polynomial com- 
puter time for drawing a hypothesis consistent with them. Vice versa, it is easy 
to fix a consistent hypothesis, but an unlimited number of sentry points needs 
when we aim to learn the class of the subsets of the real line. kK-CNF is a class 
requiring only k sentry points, at most, and a polynomial algorithm for com- 
puting the hypothesis. With Fact 3.4 and Example 3.12 we learn in addition 
that the accuracy with which we compute hypotheses may affect the number of 
needing sentry points and the sample complexity as a consequence. 

When the points distribution law is known, some synergy may be stated 
between the statistical and computational parts of the problem, using probabil- 
ity for discriminating branches of the search algorithm that prove irrelevant, as 
their implementation involves pieces of the hypothesis of disregardable measure 
from the bulk that determines U.+,. With such a strategy we can discover for 
instance a learning algorithm for k-term DNF that is polynomial in the number 
n of propositional variables, when their assignments are uniformly distributed 
within the n-long binary strings [Flammini et al., 9892]. This synergy extends 


Bibliographical notes and further readings 191 


to a stretched version of the uniform distributions, such as q-bounded distribu- 
tions where the ratio between the probabilities of two assignments is less than a 
fixed q, and corrupted labeling scenarios where both disingenuous and malicious 
errors are allowed below a moderate threshold [Apolloni and Gentile, 2000]. 

We will explore this synergy in greater depth in the next chapters. In Chap- 
ter 4 we will consider statistics more complex than the mistake frequency and 
less complex than a consistent hypothesis. But the shape of the statistic deter- 
mines a distribution law of their values that is more complex and various than 
the Beta distribution law ruling almost by alone all this chapter. In Chapter 
5 the distinction between statistic and computational aspects almost vanishes, 
as for instance we know nothing about the statistic distribution law of a feed- 
forward neural network. 


3.7 Bibliographical notes and further readings 


Let us split the sampling mechanism of a random variable Y into a step gen- 
erating an intermediate random variable X that you can observe and a second 
one computing Y from X. As usual, an explaining function maps the uniform 
variable U into X. If the second step too requires random seeds (say U specifica- 
tions) then we are used to describe its output in terms of conditional distribution 
law of Y given X specifications. Our framework is relatively simpler, since we 
hypothesize that the bulk of the second step is deterministic, where the random 
seed may just intervene in terms of moderate noise, and the target is exactly 
discovering the function c computing the step, in spite of the randomness of 
the intermediate variable and possibly of the noise. In this chapter we focus 
on Y Boolean and fix the main commitment of working with consistent estima- 
tors h of c. The latter translates into the requirement of managing consistent 
hypotheses in the sense of having h(#;) = c(a;) for each sampled x;, and raise 
the CONSISTENCY computational problem that may prove very heavy to be 
solved. We just mentioned this problem and gave a short outline of the involved 
computational aspects in Appendix D, but the reader can deepen the matter in 
many classical books such as [Garey and Johnson, 1978, Papadimitriou, 1994]. 
The statistical aspects turn around an extension of the twisting argument for the 
parameter P of a Boolean variable, in order to take into account that P depends 
on the sampled values of Y (rather of the joint variables (X,Y)) themselves. 
Our perspective has been developed in papers of theoretical computer sci- 
ence [Apolloni and Chiaravalli, 1997, Apolloni and Malchiodi, 2001], cognitive 
sciences [Apolloni et al., 2002c, Apolloni and Kurfess, 2002], and artificial intelli- 
gence [Apolloni et al., 2006b, Apolloni et al., 2004b, Apolloni et al., 2005a, Apol- 
loni et al., 2005b, Apolloni et al., 2005c, Apolloni et al., 2002a, Apolloni et al., 
2002b, Apolloni et al., 2003b, Apolloni et al., 2003a, Apolloni et al., 2002d, Apol- 
loni et al., 200la]. The classical theory on computational learning instead is 
rooted in a mostly computational framework. The idea of a machine having the 
analogous capability of human beings in learning is old as the projects for an au- 
tomatic computation. First modern results, in the province of partial recursive 


192 Computational learning 


functions paradigm, concern the learnability of wide classes of functions, such as 
the family of total recursive functions, in terms of an asymptotic identification 
of a function from an infinite sequence of its input-output pairs [Gold, 1967] 
or even a probability asymptotically converging to 1 of succeeding in this task 
[Blum and Blum, 1975]. The strong novelty introduced by Valiant approach is to 
bartering some accuracy in the identification with the feasibility of the learning 
task, i.e. with the possibility of binding its sample and computational complex- 
ity. His approach however mainly fits with a test of hypothesis framework where 
he assumes the number K of examples falling into the symmetric difference c+h 
as the statistic to test the hypothesis uc+p < £, thresholding the statistic to 0. 
Then, since h is a specification of a random hypothesis H he focus on worst case 
hypotheses to show that even with these hypotheses the probability of having 
K = 0 when Ucn < € is less than 6. In terms of classical test of hypothesis 
theory [Wilks, 1962] this can be enunciated saying that the second type risk of 
the test is ô when the alternative hypothesis is ue~p, > £. The pioniering paper 
from Valiant [Valiant, 1984] matched in the eighties with a book from Vapnik 
[Vapnik, 1982], discovered by a Californian team [Blumer et al., 1989]. This gave 
a stronger statistical support to the Valiant argument in terms of a complexity 
index, the well known Vapnik-Chervonenkis dimension, connecting the above 
relation between £ and 6 to the complexity of the class to be learnt. This index 
is used to count how many points we need to mark all the ¢-covers of the current 
hypothesis, i.e. put a point in every symmetric difference of the hypothesis with 
a concept of the concept class C to which the target concept c belongs. As we 
search distribution-free results this number depends on C and not on £. The 
great value of the Vapnik approach has been to clearly highlight that Ucp is a 
random variable, as H is the output of a learning algorithm having in input a 
random sample. We stress that also U.~n, i.e. for a fixed h, is a random variable 
per se. This is the sole framework indeed where we can consider with notation 
of Theorem (3.4) the event B = (supp+cen+¢|R(h +c) — v(h =+ c)| < £) ran- 
dom. It is the key event of the Vapnik theory at the basis of classical sample 
complexity results in its restriction to the class H of consistent hypotheses, thus 
having v(h + c) = 0. Under this constraint, the probabilistical features of B 
come in the classical probability framework from the randomness of the set H 
collecting hypotheses whose symmetric difference with c have no points of a 
random sample inside. However, to get an approximation from above to the 
probability of this event the authors enlarge H to the whole C, thus referring at 
an event with no sense in the classical statistical theories. On the contrary, in 
our approach it makes sense requiring 


P(B\v(h +c) =0) =P (sup R(h+c)< e) = P(R(h*)<£)>1-—68 (3.63) 
heH 

for some h* € H since we assume the probability measure on the error domain 

related to any h to be, in turn, a random variable. 

With our approach we run through many of the usual extensions of the learn- 

ing task, for instance concerning the correctness of the example labels [An- 

gluin and Laird, 1988, Kearns and Li, 1988], on-line learning [Ben-David et al., 


Bibliographical notes and further readings 193 


1997, Auer and Long, 1999, Helmbold et al., 2000], the structural risk minimiza- 
tion [Vapnik, 1998], i.e. bias-variance balance [Geman et al., 1992]. While we 
work directly with the U.-;, distribution law, classical results concern directly 
the sample complexity of the related learning task, usually coming from majo- 
rations obtained considering worst case situations and using broad inequalities 
like those from Chernoff [Chernoff, 1952] or Chentsov [Chentsov, 1963], with a 
few exceptions when dealing with distribution dependent results [Bartlett and 
Williamson, 1991]. Actually we have no much results on the last topic com- 
ing from our approach thus the reader can find distribution dependent results 
starting from [Benedek and Itai, 1988]. In absence of topological information to 
relate the concept parameters to the label of the sampled points (what will hap- 
pen in the next chapter with regression theory), we should investigate on sentry 
points’ saving due to to the shape of the points distribution law, as suggested 
in Sec. 3.3.1. 

A key index in our approach indeed is the detail Dc of a class of concept. 
It allows for a more friendly appreciation of the complexity of a concepts class. 
We come back to the Vapnik-Chervonenkis dimension however to get the worst 
case lower bound on sample complexity as found in [Blumer et al., 1989]. In this 
way we lower to a multiplicative constant the worst case gap between sample 
size distribution-free lower and upper bound in case we learn by consistent 
hypotheses [Apolloni and Chiaravalli, 1997]. 

Starting from the notion of sentry points, at the basis of our complexity 
index, we went ahead in many topics of the computational learning theory. In 
particular, discussing the Support Vector Machines, as a tool for computing 
hypotheses much efficient and popular in recent years [Schélkopf et al., 1999, 
Cristianini and Shawe-Taylor, 2000], we pointed out a hidden relation between 
the complexity of a concept class and the computational effort spent to learn 
it, as for the approximation with which hypotheses are computed. In case of 
SVMs we saw that the detail of this class is an inverse function of the accuracy 
with which the hyperplanes are identified. 


194 Computational learning 


4 — Regression theory 


As we cannot code all decisions strictly in terms of 0 or 1, we may want to learn 
a function by easing it into a set of values — possibly approximated through 
a continuous variable. For instance, we prefer having a graduated system of 
unsolicited messages, in order to postpone reading the least interesting ones 
rather than eliminate them directly. A large part of our decisions (such as in 
controling a plant, measuring ingredients for a cake, quoting a stock, and so on) 
are expressed in terms of multivalued variables that we may call actions instead 
of decisions. Learning a function outputting an action is harder than learning 
a Boolean function, of course. Thus we generally need: 


e more information about the examples distribution law. Few results can be 
obtained in a distribution-free way on this matter (the most useful ones 
are in fact not); 


e classes of functions where we may discover the target function. The very 
comfortable condition occurs with linear models, where we are allowed to 
compute an approximation of the action whose difference w.r.t. the target 
value is an easy random variable playing the role of additional noise. In 
this case the target function may belong even to very complex classes. 
We will however also provide cases where a more intricate relation exists 
between the goal function and noise. 


Regression theory is the hat of the various methodologies for learning these 
functions, i.e. approximating a function whose analytical form is known up to 
a finite number of parameters. On the basis of the inference tools enunciated 
in Chapter 2, i.e. population bootstrap and twisting argument, we delineate 
a very general learning procedure with the idea that also the functions of 
a concept class may be managed as random elements to which a probability 
distribution may be associated. Hence, starting from a sample, we may rely 
on the typical products of inference, such as confidence intervals and point 
estimators. At the end of this chapter we will also provide a distribution-free 
general test for checking the suitability of the adopted concept class, i.e. of the 
employed regression model. 


195 


From labels to 
effects 


Focused 
examples 
distribution law 


i.e. a family of 
functions + noise 


Need for 
computing 
functions’ 
distribution laws 


A function c to 
explain effects of 
future causes 


hence a random 
element within a 
family of 
functions 

hence a 
hypothesis h 
statistically close 
toc 


Confidence 
region 


196 Regression theory 


4.1 The reference problem 


Extending the framework of Definition 3.3, we consider here a cause/effect sam- 
ple 
Zm = {(zi; yi), Ti E ¥, yi € D, i = l; eee ,m} C (¥ x D)” (4.1) 


Again, we assume that a class C of functions exists such that for every 
M e€ N and every (cause/effect) population Zm a c exists in C such that 
Zm+M = {(£;, c(£i,€i)), i = 1,..., m + M}. Note that in this case we assume 
by default the intervention of the random variable E; (having specifications ¢;), 
thus having a double source of randomness: the seed U of the sampling mech- 
anism of X, and E representing the relevant random variable of the problem. 
Possibly independent of U , E is deputed to render Y less rough and consequently 
the learning problem less difficult. Namely, for E = 0 the problem is generally 
over-determined and then impossible for a non exceedingly complex class of hy- 
potheses H (for instance, wanting to explain three non aligned points (x, y) on a 
plane with a straight line). The random term introduces the degrees of freedom 
missing (hence we fit the points to a straight line) in the move from the complex 
concept class C to H. Sometimes, however, E represents a true source of ran- 
domness on a C coinciding with H. In any case we look for a learning algorithm 
A: {Zm} > H, so that (Zm) is close enough to c with good probability, for 
a suitable closeness measure. Usually we measure the accuracy simply as the 
norm of the difference between computed and target y. This entails asking for 
the consistency of the inferred function in the sense of Definition 3.2, possibly 
with a fast increase of the accuracy with the sample size. 

Comparing this with the scenario depicted in Fig. 2.4 we have simply an 
extension of the structural complexity of the observed data. This reflects on both 
a possibly richer set of parameters characterizing the sampling mechanism and 
a greater complexity of the explaining function, identifiable within the family 
C through the parameters. To be more precise, on the one hand we forget 
the sampling mechanism generating X, hence its seed U as well, since we are 
interested, at least at a first issue, in inferring the whole curve c as a structure 
underlying the entire ¥ 1. On the other, we have a composition of sampling 
mechanisms: a first one from a seed U’ to E (often taken for granted), and a 
second one with exactly E as default random seed and a c as explaining function. 
Of course there is no objection or technical difficulty to reverse the randomness 
of the parameters into a randomness of the explaining function that, as a random 
function, will be denoted by C. Thus we have a family of random functions. 
The problem consists in finding an ordering of this family that allows us to 
exploit its randomness. 

Let us extend by first the notion of confidence interval to regions as follows. 


Definition 4.1. For sets ¥, Y and a random function C': ¥ +> J, denote by abuse 
c C D the inclusion of the set {(x,c(x)); Vx € ¥} in D. A confidence region at 


1 Actually the interest in the randomness of X declines after the assumption of its inde- 
pendence from E. 


The reference problem 197 


level y is a domain Ð C ¥ x Y such that: 
P(C CH)=1-y¥ (4.2) 


Remark 4.1. Within the mentioned convention of disregarding X randomness, 
this notion of confidence region is quite different from what is generally searched 
for in regression theory, i.e. a collation of confidence intervals for a Y parameter, 
for instance for E[Y], on each given x. Actually, in some cases we will retire on 
a weaker requirement: we require c C D in the operational range of x. 


Then let us imagine a family Y of nested confidence regions with the obvious 
relation: 
DCH (1-79) < l-rg) VD, D ED (4.3) 


We are free to select 2 provided we are able to compute the probability measure 
of elements of C inside it. Stated from an operational viewpoint: from among 
possible F select a family meaningful in application terms and manageable as 
for the probability computations. We will use a general procedure for identifying 
a confidence region inside Y that is made up of the following building blocks: 


CD The curve distribution. The curve distribution derives directly from the 
distribution of its parameters. It may be analytically deduced in the cases 
where we may exploit a twisting argument, and possibly employed to still 
build confidence regions analytically. Otherwise it may be numerically 
computed through the population bootstrap. 


PV The pivot of the nested family Y, i.e. the central curve around which to 
displace all elements. It may depend on the special operational problem 
we are dealing with, in principle. A default selection may be constituted 
by the maximum likelihood (ML) curve cġ, i.e. the curve with parameter 
© set to its MLE ð. In own turn we have seen in Fact 2.4 that MLE of a 
parameter represents the mode of its distribution law when: i) a sufficient 
statistic is used in the twisting argument to discover this distribution, and 
ii) the order relations on the statistic and the parameter coincide. Thus, 
in the case of injective mapping from parameters to curves, we have that 
under the above conditions cg too represents the mode within C family. If 
we have no elements to compute the MLE of ©, we may look directly for 
its modal value within its bootstrap distribution. 


NR The nested regions. According to Definition 4.1 we are looking for a 
domain completely containing (modulo the weakening in Remark 4.1) the 
curves accounting for its probability measure. Thus measuring the nested 
regions in J (see (4.3)) requires a metric through which sorting the curves 
in a growing distance from the pivot. We may fix this metric analytically 
as a function of O, otherwise we may employ peeling methods [Rousseeuw 
and Hubert, 1999] on the bootstrap population of curves. 


General 
procedure 


a distribution of 
the curves in 


a pivot for the 
nested regions 


Nested regions 


198 Regression theory 


o wuluku Lu 


Fig. 4.1: Measuring the drops of a spring for different payload masses hanged on it. 


Fixing the metric e In the former case we must devote particular attention to the simple 
connectivity ? of the region D identified by a given maximal distance 
of its items from the pivot. Indeed, in order to compute the prob- 
ability of © from the probability of O ranging in a suitable domain 
Ao we need to guarantee the surjectivity of the mapping from Ao 
to Ð, in the sense that we must be sure that to any curve belonging 
to Ð corresponds a 0 € Ao, so as to prevent underestimation of the 
probability of 9. 


peeling methods e In the latter we use an implicit metric. Namely, we start from a 
bootstrap population of C and remove the 1007 percent of curves 
that are not contained in the envelope of the remaining ones (we peel 
the population indeed). As we will see later on, the implementation of 
this method requires some sagacity. However, under weak symmetry 
conditions, we may expect the domain to still be centered around the 
MLE pivot. 


The general procedure sequences the blocks as (CD — PV — NR) or (CD > 
NR — PV) depending on the combination of analytical and numerical steps 
making it up. 


4.2 Linear regression 


Take a one-kilo mass and hang it on end a of a spring whose other end b is 

attached to the ceiling. You will see a drop of 2.03 centimeters. Then repeat 

the experiment with 2, 3 and 5 kilos masses. With respect to the original (un- 

weighted) position, you will get drops respectively of 3.98, 5.87 and 10.1 cm as 

E may represent in Fig. 4.1. You immediately deduce that the spring drops of 2 centimeters per 
the error measure kilo (the minor differences, or errors, are due to a lot of micro-phenomena af- 


fecting the measures). You refer to an ideal linear elasticity phenomenon ruling 


2A region is simply connected (also called 1-connected) if any simple closed curve can be 
shrunk to a point continuously in the set [Courant and Robbins, 1966]. 


Linear regression 199 


the experiment, relying on it because of all the theoretical issues and experi- 
mental evidence, and want to discover the linearity coefficient. It may be a little 
harder to formulate a theoretical model for explaining a sharp relation between 
the length of an unsolicited message and how dull it is. But if it works experi- 
mentally (some minor differences aside), we are interested in approximating the 
degree of dullness with such a function, assuming as disregardable and random 
the part not computed by it. In our approach this is the true rationale behind 
the linear regression, which we formalize as follows: 


e the sample general form is 
Zm = {tit Ti E€ X, yi € 9,4; = callt) Fini = lerc Mm} 
c (8x9) (4.4) 


where the £; values are specifications of a set of corresponding null-mean 
independent random variables E;, modeling an additive noise (or an error 
term in a moralistic vision); 


e the concept class is parametrizable in the form C = {ca, a € Da}, where 
a indicates a parameter or a set of parameters; 


e the random variables X; and Y; are linked by a functional relation. 


Example 4.1. Linear regression problems are those of discovering lines, poly- 
nomials or exponential functions fitting points in R?, under the model y; = 
Calti) + £i, i. e. with 


1. ca(x) = ai +a2x 


2. Ca(x) = aytoagr + azz? +... 

3. Calz) = aye” + aze” 

An instance of non linear regression problem is connected to a model like y; = 
pri 


vite” 


4.2.1 Learning a confidence region for a regression line 
The easiest and best known instance of regression problem is the following: 
e ¥ and 9 are subsets of R; 


e the concept class is C = {cq p(x) = a’ + ba, a’,b € R}, where, given a 
sample {x1,...,U%m} the parameter a’ can be expressed as a’ = a — bT 
m 


1 
being T = — > xi and a € R, so that the generic element l reads: 
m 
i=1 


l—a+b(z-— z) (4.5) 


explaining drift 
from a linear 
relation 


Additive noise < 
linear regression 


Underlying linear 
relation < linear 
regression line 


200 Regression theory 


together with a e the random variables E; are assumed to be independent, and to have 
simple noise distributions identical and symmetric around 0. 


Remark 4.2. To facilitate the future computations, we adopted the special rep- 
resentation of a line whose ¥ coordinate is centered around 7. This is without 
loss of generality, since it corresponds to shifting the origin of the ¥ axis by T 
in the framework where ca (£) is represented through a + ba. 


Linear regression curves are extremely clean models. They allow us to state 
simple twisting arguments directly on the parameters of the entire curves, so 
that we may deduce their distribution laws either analytically, if we know the 
distribution laws of the involved statistics, or numerically, if we must bootstrap 
their populations. Therefore the three blocks of the outlined procedure read as 
follows. 


4.2.2 Curve Distribution 


The independence of E w.r.t. X (hence w.r.t. the underlying U) allows us to hide 
the X sampling mechanism, which corresponds to assuming X specifications to 
be always known a priori and to focusing on the intermediate model that holds 
for every pair (xi, yi) 


yi = a+ (aj —T) +e; (4.6) 


4.2.2.1 Plugging parameters’ distributions into curves 


Who knows a Taking the explaining function of £ for granted, we focus on the random param- 
and b? p p : : 
eters A and B whose specifications a and b appear in the above equation. 


e We may derive two separate implications from the following relations, 
since }`;-; (£; — T) = 0: 


Sou = mat > ei (4.7) 
i=l L 


= 
m m m 
= AŽ = 
` yi (zti- T) = D9 (zi — T) + 5 Ei (xi — T) (4.8) 
i=1 i=1 i=1 
Shaking them It is evident that, denoting with y; the value assumed by the observation 
ang relans yi when unknown parameters shift from a and b to a and b respectively, 


we can state 
(a<ae (>. n<) i) (4.9) 
(0 < 9) e (È yi (r; =T) < > Ui (r — >) (4.10) 


Linear regression 201 


e To be sure that the above implications hold homogeneously for every sam- 
ple (avoiding drawbacks coming from missing points as in Example 2.8), 
we must check that X`; yi and S>.", yi(xi — T) are sufficient statistics 
w.r.t. a and b respectively. In this case we can assert that logical relations 
(4.9) and (4.10) represent a twisting argument. 


We may check this through the factorization criterion (2.34). In particular, 
when E belongs to the exponential family of random variables having a 
density of the form 


fe (e; 8) = a(9)B(e) exp[7(9)d(e)] (4.11) 


since €; = yi — a — b(xi — T) by (4.6), denoting k;(0) = a+ b(a; — T) the 
function of the parameter set V = {a,b} (fixing x; and 7 as conditioning 
variables) and writing ¢; = yi — ki(V), we must consider the subclass of 
densities where the sample likelihood 


Hunting sufficient 
statistics 


At least for E 
exponentially 
distributed 


independently of 
X 


L( (Tis Yi Jees (2m; Ym) 0; 8) = Llyi,---,Ymj 9,0, £1,-.-,lm)L(41,..-,2m) 


= a(9)™ [5 — ki(ð)) exp (>: (9) 5 (Yi — i) 


i=l 


L(a1,..-,%m) (4.12) 


can be factorized according to (2.58) in relation to both 0 and ð. This 
may happen when 


(i) 6(€) = e?, so that 


= exp (È (8) (yj — 2ayi — 2byi (wi — ») exp (È omw) 
g = (4.13) 


and the second exponential function can be incorporated in a(0)™. 
Obviously, the same considerations are applicable when 6 depends 
linearly on e? 3. 


The factorization 6(y — k(0)) = 8ı(y)b2(k(V)) holds. This is true 
when either 3(c) = 1 or B(e) = e% for known d € R, where the last 
case reduces to the former by splitting £ in the two components y; 
and k;(v) and adding dy; to the first exponent in (4.13) and dk;(v) 
to the second one. 


(ii 


wa 


To summarize, when ¢; are distributed according to a density of the ex- 


3The density classes for the two releases of ô coincide since we move the constant in (@). 


with a symmetric 
drift 


and other 
simplicities 


For fair | 
exponential errors 


202 


simple sufficient 
statistics 


Twisting A with 
E distribution law 


Twisting B with 
E distribution law 


Regression theory 


ponential subfamily whose general element has the form 
fe(e;0) = a(6) exp(7(8)e”) (4.14) 


then 


L((x1,y1), Eeg (Em, Ym); 0, 9, £1, oad : Lm) 


= h(0, V) exp (>. ¥(9)(y? — 2ayi — 2by;(a;, — »)) (4.15) 


and L(z1,... 
fore 


,Zm) does not depend on the boiling up parameters. There- 


m 
= > yi is a minimal sufficient statistic for a; 
i=1 


m 
— 5 yi(xi — T) is a minimal sufficient statistic for b; 
i=l 


m 
— 7T), and) ` y? are joint sufficient statistics for 0. 
i=1 


m m 
~ Dovid vili 
i=1 i= 


e Passing to the probabilities of the equivalent events for € within this family 
of random variables 


1. Since 
m m m m 
(So = 5z) = (so -mā < $on- na) 
{=l i=l i=l i=1 


we may work with the random variable 
m m 


= Dr — ma), for any a and have 


i=l 


2 Yi — na) (4.16) 


F4(@) = 1— Fs, ( 


2. Since 


> uila =7) = vile —&) 
(È yi(vi — T) -09 -7)} < > D(z -— 7) -5 = z) 


Linear regression 203 


we may work with the random variable SE = 5 E;(2; — T) = 
i=l 
>) Y;(a; — T) — bY (z; — 7)°, for any b and have 
i=1 i=1 
Fp@) = 1 Fg (>: Yi (a; —f) -0X (2; -#) (4.17) 
i=1 i=1 


Since Cov[S_, SŁ] = 0 4 we have Cov[A, B] = 0, which denotes A and B linearly 
independent. This means a certain degree of independence that coincides with 
full independence when E distribution falls in a wide family whose most relevant 
element is the Gaussian distribution. However, also in this case using A and 
B separately for finding a confidence interval for straight line (4.5) proves not 
efficient. You may draw in the a, b plane a rectangle of edges (As /4, Ai_5/4) and 
(Bs 4, Bi_5/4) as in Fig. 4.2 to have a region where the parameter © = (A, B) 
falls with probability ~ 1 — ô for ô small enough. However, the corresponding 
region spanned in the plane x,y by the lines with parameters a,b in the above 
rectangle as in Fig. 4.2 suffers from the drawback that the lines are not well 
ordered: since the line d.f. is the product of the plugged parameters’ d.f.s, you 
have in the same regions lines with extremely high d.f. and lines with extremely 
low d.f. Moreover, if we abandon the rectangular shape, for instance pointing 
at elliptical shapes, so that the border of the confidence region coincides with 
a contour line of the joint c.d.f. of A and B, then we lose the connectivity of 
the region, in the sense that you have discontinuous angular coefficients. To 
overcome these drawbacks we move to a more suitable way of managing the 
curves’ probability distribution. 


4.2.2.2 Pointing at the curve as a whole 


The procedure for finding the distribution of either the regression line and the 
observed points, namely the distribution for A+ B(a—%) and A+ B(x — T) +E, 
is a little more complex. 


e In order to state a relation analogous to (4.9) and (4.10) involving the 
entire regression line we need to introduce an order relation also in the 
lines’ family. We do this through a contour lines’ family Dp constituted 
by the envelope of: i) either a set of straight lines y = a* + b* (x — T7) such 
that b* > 0 and a* + b* equals a thresholding parameter k, and ii) a set 
of straight lines such that b* < 0 and a* — b* = k. Each Dķę partitions 
the family of all the lines in a plane in a set J; and its complement such 
that: a) each line Z € I, lies completely under Dx, and b) for each k’ < k 
contour Dy lies completely under Dx. 


Figure 4.4(a) later on gives a qualitative picture of Ip. We will consider it 
as the intersection of two regions bounded either before 7 (call it J!) or 


4thanks to the sole hypotheses of independence and 0 mean of Es. 


Plugging 
parameters 
generates a poor 


confidence region 


either including 
unessential 
curves 


or leaving holes 


managing curves 
may prove 


complex. 


Splitting the 
problem in two 
halves 


204 


Regression theory 


(a) (b) 


Fig. 4.2: A confidence regions for the regression lines: (a) A two dimensional interval 
in the (a,b) parameter space, and (b) the corresponding region of lines y = a + bx in 


the (a, 


with a condition 
for a and b 
separately 


plus a condition 
on their sum 


gets a way for 
involving 
statistics 


y) space. 


after T (call it Ij). 

First of all we notice that the right part after of a line y = a’ +b'(x — T) 
lies completely over the line y = a+ b(x — T), i.e. dominates it, if and only 
if both parameters a’ and b' are greater than or equal to the corresponding 
parameters a and b. Namely 


(axa Ab<U)e(at+b(2—-Z) <a’ +b (z -— T), Vx > T) (4.18) 


Moreover, we recognize that parameters a and b of a line £ have a sum less 
than or equal to k if and only if two numbers a’ and b’ exist with a sum 
less than or equal to k as well which are parameters of a line lying over £ 
for each x > T. Le. 


(2a, s.t. (a +b! < k) A (a +b(x— 7T) <a’ +b (4-2) va >)) 


s (a +b< i) (4.19) 


Following (4.9) and (4.10) we may also state: 


(a+b<k) s [a,v such that (a +0 <k) 
A (Ji =a +b (a; — T) +e, Vi=1,...,m)A 


m 
aTe Eu << a+ 
i=1 Si _ T)? i=1 


i=l 4=1 


(4.20) 


Sc zi — T) 
mo 
Sie: 


Linear regression 205 


since the last brackets read (ma + mb < ma’ + mb’). 


Joining (4.19) and (4.20) we have: 


for each a,b such that a +b < k it exists a’,b! < k such that 
(Ji =a’ +b (a; — T) +e; Vi =1,...,m) 


m Xul) n Xale- 2) 

A |X) utm N hm 
z X mim = X m - 2)" 

i=1 i=1 


s [(a+b(x — T) <a’ +b (x— T), Vx >7)] (4.21) 
For k € R we can therefore consider the ceiling family 
DE ={(a",b’)}; = farg sup" {(a',¥) s.t. a’ +b! < kb} (4.22) 


representing the top contour of I HE where argsup” means any pair (a”, b"), 


with a” +b" = k, of parameters of a line £ such that no line y = a! +b' (£ — 


T) € I lies even partially over € when x > 7. Thus the membership of that reads in 
k terms of a direct 


line y = a' +b' (x — 7T) to IY is checked for a suitable element of Dy through twist pencen ad 


the implication: statistics 


(a +b(xz— T) <a +b (x -—7T)) e 


So utm <S ntm (4.23) 
i=1 (x; NP T)? i=1 ye a T)? 
i=1 i=1 


with Yi =a" + b (a; = T) + &j. 


e Therefore if we introduce the random variable 


m v4, — = 
Se = Se + Se- = E; | 1+m————_ 
2 a-m 2 (z: — 8) 
i=1 i=1 
we obtain the following distribution function and a distribution 


law for straight 
lines. 


Fry (If) = Plk © If) = 


206 Regression theory 


= 1 — Foy 5 Yi — ma” +m = — mb" (4.24) 
i=l (xi = T) 
a 
Analogously on e Considering I L hence the complementary constraint x < T we have anal- 
thelat ogous implications 
(a <a Ab>v') (a+b(x—T)< a +b(x-— T), Va <T) (4.25) 
with the sole and then 


change of a sign 


(a +b(x-7T) <a +b (x -7T)) os 
” Xulia) n yi(xi — T) 
Y yu- mg <S p-m | (4.26) 


for any element (at, bt) in a ceiling family 
D} = { (a,b) }; = farg sup! {(a',¥) s.t. a’ —b' < kb} (4.27) 


representing the top contour of I i, where arg sup! means any pair (a!, b’), 
with a! — b = k, of parameters of a line Z such that no line y = a’ + 
V(a—-Z) el r lies even partially over € when x < %. The probabilistic 
companion reads: 


Fr (ij) = PUk C Fi) = 


i 2 ilei =T) 
=1-— Fg 5 yi — ma! — mE + mb! (4.28) 


i=1 Yia - 7)? 


i=l 


The latter cumulative distribution function in (4.28) still refers to the 
variable S¥, thanks to independence and 0 symmetry hypotheses on E 
distribution. 


4.2.3 Pivot 


And for having We may read the last members of (4.24) and (4.28) as 
the whole line in 
a confidence 


ics Fig (Ig) = 1 — Fsg(—mdal — mAb) (4.29) 


Linear regression 207 


BPN WB UA OS 
RON WwW eB ODI OD 


920 930 940 950 960 970 a 920 930 940 950 960 970 a 


(a) (b) 


Fig. 4.3: Two different strategies for gathering curve parameters into a confidence 
region. (a): separately constraining the parameters; (b) jointly managing them to 
avoid scarcely representative elements. 


Fy (Ij) = 1- Fsy(—mAa! + mAb’) (4.30) 


which figures a distribution of shifts AA and AB around the estimates a = 

m y 51 Yili — T 
and of displacements around the corresponding line obtained by plugging these 
estimates in the model (4.6) as for the entire line. These estimates are MLE 
in the case that E follows a Gaussian distribution or one among a rich set of 
symmetrical distributions containing it. 


of A and B respectively, as for parameters, 


4.2.4 Nested Regions 


Equations (4.29) and (4.30) tell us that, with the mentioned pivot, a suitable 
family Ð of nested regions comes from a joint binding of quantities AA and AB 
through proper quantiles exactly supplied by these formulas as a function of the 
confidence level y. However, to make the Ð elements definite we need another 
relation on the pair of parameters (A, B) and we may decide to use either (4.16) 
or (4.17). With current notation, they read: 


F4(a) = 1 — Fs, (mAŭ) (4.31) 


Fp(b) =1- Fg (Ads...) (4.32) 


where Sre = ye —z)*. 

Denoting with k4,kg and kr the absolute values of the symmetric range 
extremes allowed to AA, Ag and their sum around 0 the confidence region is 
determined by two of the following three equations: 


take a center line 


and displace a 
region around it 


the shape of the 
confidence region 


the borders are 
very simple 


208 Regression theory 


-100 


(a) (b) (c) 


Fig. 4.4: Confidence region shape: (a) upper bound, (b) lower bound and (c) both 
bounds of the confidence region for the straight line in the linear model (4.6). 


-ka< AA Zka (4.33) 
-k< AB <kpg (4.34) 
TuUe © kaor CLAR, (4.35) 


with r, being the analogous of the set I Hh " where its elements are bounded 
from the bottom by floor lines with a+ / — b > —kr. 

Using (4.33) and (4.34) you get the regions in Fig. 4.2 in force of A and B 
independence (whenever it holds), as in Fig. 4.3(a). To overcome the unsuitable 
spread of the confidence region we mentioned earlier about this solution, we 
work on the I u " distribution in conjunction with the one least spread of the two 
parameters (i.e. the one having the smallest variance). In this way, we directly 
control the range of this parameter and get the shortest conditional range of 
the other as it comes from the control on I u ". Using for instance (4.34) and 
(4.35), we look for the set of (a,b) pairs jointly satisfying these conditions (see 
Fig. 4.3(b)). This may figure the confidence region in Fig. 4.4(c) spanned by 
the lines with these parameters as the intersection of the upper bound region in 
Fig. 4.4(a) and the lower bound region in Fig. 4.4(b). As a matter of fact things 
are a bit more complex. We may achieve it and gain a very suitable shortcut 
for drawing the domain from these elementary geometrical considerations. 

Namely, if we use (4.33) and (4.35), 


e the intersection of ceiling lines a + b(x — z) with a+ b = kr occurs only 
on the right of Z in the point (¢+1,a@4+ b+ kr) °, while the intersection 
of the ceiling curves with a — b = kr occurs only on the left of z in the 
point (Z — 1,ă — b+ kr); 


e we have analogously the two points (7+1,ă&+b— kr) and (—-1,4—b—kr) 
looking at the floor lines; 


5simply from equalities y = a1 +bi(a—Z) = a2 +b2 (£ — 7) aı —a2+(b1 —b2)(x — 7) = 0. 


Linear regression 209 


e the constraint |@ — A| < ka says that no line may cross the vertical line 
passing through Z with a y outside the interval (ă — ka, + ka). 


Thus we have that the upper contour of the confidence region is identified by 


the lines l; passing through the points (see Fig. 4.5(a)): 
lı > (@-1,4-—64+ kr), (@+1,4+6-kr) (4.36) 
ly > (@-1,4—b+ kr), (Z, +ka) (4.37) 
l3 > (Z,+ka), (Z+1,4+6+ kr) (4.38) 
la > (B-1,4—b-— kr), (2+1,44+5+ kr) (4.39) 

and the upper contour 

is = la (4.40) 
lg > (z — 1,ă— b — kr), (2, —ka) (4.41) 
ly > (Z, -ka), (7+1, +b- kr) (4.42) 
l= h (4.43) 
If we use (4.34) and (4.35), the control on B by first substitutes segments (4.38) 


and (4.39) with the sole segment 


lg => (7—1,4 — b+ kr), (2+ 1840+ kr) (4.44) 


and segments (4.42) and (4.43) with the sole segment 


ley > (7—1, — b — kr), (£+1,4+6- kr) (4.45) 
These segments bind the regions in Fig. 4.4. They derive from the fact that: 


e the constraint |b— B| < kg says that the highest (lowest) ordinate allowed 
to a line when —1 < x— 7 < 1 is kr (-kr) that you get when Ab” = 0 and 
Aa” = kr (Ab! = 0 and Aa! = —kr). 


The four extremal segments start from the same abscissas z + / — 1 with angles 
directly saturating the condition |b — B| < kp. 

The values of kr and ka (kg if we limit B by the first) depend on the quan- 
tiles associated to the variables A+ B and A, and ultimately on the distribution 
law of E as we will see in a moment. The scale adopted for x may draw the 
intersection abscissas nearer or farther than +1 from z. In other words, we may 
develop our theory by considering the distribution of A + pB in general getting 
the same shape of the confidence region but different distribution of the included 
lines, privileging either the neck or the tails of the region (see Fig. 4.5(b)) for 
the same confidence level. 

In regard to quantiles, we are in the hypothesis that A and B are inde- 
pendent. This makes the joint use of (4.33) and (4.34) easy. Still searching for 
simplicity, we will consider E such that AA+AB and AA—AB are independent 


only 8 edges 


satisfying four 
conditions on the 
line 


and one condition 
on the sole A 


210 Regression theory 


(a) (b) 


Fig. 4.5: Contour of the confidence region for a regression line. Black piecewise lines: 
upper and lower contour of the confidence region; gray lines: lines l; summing up the 
contours; gray points identify the contours as in (4.36-4.43).ca/o» < 1; (a) p = 1 and 
(b) p = 0.567. 


variables. This lets us manage the intersection J} N It. It is a condition that 
occurs for instance when AA and AB are independent and Gaussian (which 
holds for instance when E;’s are both independent and Gaussian). Finally, the 
analysis of Fig. 4.5 suggests to us that: 


e The part of the confidence region above the pivot line is determined by the 
line of I} and It, with AA > 0. Analogously the flooring with AA < 0. 
Hence the whole region is determined by the intersection of I ker NI; made 
up of lines with AA < k4 and D ir U I” pp with AA > —ka; 


e both events on A and events on re are independent (not variable A and 
A+B or B and A+ B, of course). 


Thus we may request that both AA + AB and AA — AB involved in (4.29) 
and (4.30) fall in a range with probability 1 — y/4 each, so that ILM If is 
bounded with probability 1 — y/2 from both the top and the bottom (for small 
y). An analogous condition on the single AA (or AB) requires giving AA 
alternatively a range (—oo,k4) or a range (—k4, +00) (AB a range (—ov, kp) 
or (—kg,+o0)) with confidence 1— 7/4. By plugging the parameters satisfying 
these conditions in the line equation (4.5), we finally define the Ð element of 
index y, i.e. the confidence region for the entire straight line with confidence level 
1—7. In greater detail, adopting the usual tail symmetry strategy to determine 
the intervals’ extremes, and exploiting the assumed E symmetry around 0, we 
arrive at the following equations: 


1— Fs (—mAa,— mAbs) = 1-7/8 (4.46) 


S 


1 — Fgy(-mAa; + mAb) = 1-7/8 (4.47) 


Linear regression 


where subscripts i and s denote respectively 


211 
= 7/8 (4.48) 
= 7/8 (4.49) 
= 1-7/4 (4.50) 
= 4/4 (4.51) 


the inferior and superior ex- 


tremes of the interval, and a* stands for both a” and for a! (or analogously 
1 — Fg, (—mAbs) = 1— 7/4 and 1 — Fs, (~mAb;) = 7/4 in place of the last 


two equations). 


4.2.5 Confidence region for the dependent variable 


It is customary in the study of regression curves to identify a confidence region 


also for the single points (X,Y) of the suffix of the observed sample. 
they may be modeled for any fixed x through 


Y=A+B(a—-Z)+E 


we may set a’ = a + £ and repeat the previous passages, having 


CD 


((a' +b(x — T) <a" +b" (x — T)) Ys > T 


m 
m Dwele — 8) 
i=1 
a yim : m 

i=1 


with (a’",b") € Dg, with the final result 


Frp (I) = PU c If) = 


m 
= 1 — Fow X yi — ma" +m 
i=1 


Since 
(4.52) 
eo 
i > mi-z) 
<Y ntm (4.53) 
i=1 Si = T)? 
i=1 
yi (xi — T) 
El im | (4.54) 
(x; = z) 
i=1 


with SZ’ = Sl + mE. Analogous extensions are found in the region x < T. 
E E 8 8 


PV The pivot is the same as for the regression curves. 


NR Now, to shape a confidence region for the actual (X,Y), we have: 


The single points 
may drift by £ 


Similar relations 


212 Regression theory 


e similar inequalities as (4.46) with the sole shifting of the underlying 
variable from S% to SW, namely: 


1- Fu (—mAall/" - mAb) = 1- 
(4.55) 
1— Fu (—mAall/" +mAb/") = 1- 


CO] col 


similar regions to get bounds on A +E, 
—(SE")1~y/4/m + |AB| < AA’ < (SE’)1-4/4/m—|AB| (4.56) 


once 


e bounds on A and B have been fixed from the confidence region for 
the line. 


a three-stage Thus we: i) gather A according to (4.34) ii) B according to (4.33) given 
procedure, A, and iii) A+E given A and B according to (4.56). Analogous procedure 
if we bind B by first. 


4.2.6 Implementing the procedure 


The actual implementation of this procedure passes through identification of 
the Sg distribution law. In the following we will see some examples complying 
with the loose assumptions made on this variable, addressing the reader to the 
subsequent section in the case of non compliance. 


Focusing on Example 4.2. If shifts E; are distributed according to a Gaussian law of null 


using oi , ; 
Gaussian drifts ean and known variance o2, the variables Se, St, SE and S¥' follow a Gaus- 
2 


sian distribution of null mean and variance ma”, Sz207, (m + =) o7, and 
TE 


2 
m 
(r? +m + — )o?, respectively. Moreover for E Gaussian variable it is easy 


TL 
to prove that AA + AB and AA — AB are independent. So, it is easy to 
compute a 1 — 6 confidence interval for the regression line. Since the ratio 
oa/op = y/Sxxz/m is less than 1, from (refeq.regr.uno) and (4.17) , we decided 
line confidence to bind AA singularly, so that the interval is described by the region spanned 
region s e è 
by the sheaf of lines y = a + p(x — T), with 


1 m 
oe Yi — A. <a<— ae Yi + 244 = (4.57) 
mE vm m 
Suey, 
= — Z1—y/40 ao o G= Ai << 
(zi -77 4=1 


Linear regression 213 


S ulr- 7) 


= 1 1 
<p tery tg 
die: - 2)? 


i=l 


(4.58) 


1 m 
= 


where z, is the 7 quantile associated to the Normal distribution, and the con- 
fidence level 1 — y/4 refers to the joint satisfaction of left and right conditions 
on 8 got by the shift absolute values in (4.58). 

The region at the same confidence level for (X,Y) points is spanned by lines 
y =a! + p(x — T), now with 


X ulzi- 2B) 


tL 1 I 
Z p= Zj ,/l+—4 | —_____| <a’ < 
mou 1 7/47 aa Saa B m x S 


= (a; — 2)" 


Figure 4.6 shows the 0.900 confidence regions obtained for regression lines 
and sampled points using those formulas. We used a sample of 20 observations 
that we generated using model (4.6) with parameters a = b = 15 and a = 20 
(that we call reference values). The contours of these regions have been drawn 
through the lines (4.36-4.43) and similar for the single points. We omitted draw- 
ing the pivot of the regions as they are obviously represented by the set of the 
mean points of the confidence region contour lines. Figure 4.7 provides a com- 
parison, at different scales, between those regions and the corresponding ones 
obtained using standard statistical theory [Sen and Srivastava, 1990] through 
the formulas 


points confidence 
region. 


While standard 
confidence 
regions read ... 


A region really 
spanned by the 
lines’ population 


214 Regression theory 


Fig. 4.6: Confidence region for a regression line and related points. Bold line: reference 
line y = 15+ 15(a — T); bullets: sample points generated from the reference line, with 
X uniform in [1,3] and Gaussian drifts with u = 0 and ø = 20. Dark shadow region: 
0.900 confidence region for the regression line. Light shadow region: 0.900 confidence 
region for random points. 


m 2 Yili 
1 i=l 
= Yr 
(=l X (z: -T 
i=1 
eo oe = 
< yt = == 6a 20 TAE or 
m ; / m y 
g X (m: - 7) > (i - 7) 
i=1 =l 


In the figure we improperly compare confidence intervals referring to variables 
that come from different theoretical models yet the same operational framework. 
Indeed E[Y] stands for a fixed line a + b(x — T) originating any pair (X,Y) 
with the addition of noise E;. In any case we expect to find a straight line 
that explains our past and future observation pairs. Comparing the two pairs 
of regions we note that those coming from our inferential framework include 
the corresponding ones coming from the other approach. Is this broadening 
necessary? To check it we compute the coverage of the confidence region, as 
customary. Namely, we extend the numerical experiments in Fig. 4.6 by drawing 
a bootstrap population of 100 lines compatible with the employed statistics 


Linear regression 215 


(a) (b) 


Fig. 4.7: Contrasting the confidence regions of Fig 4.6 with those deriving from the 
standard theory. Dark curves: standard 0.900 confidence region for the regression line. 
Light curves: standard 0.900 confidence region for points. (b) is a magnification of 
the left part of (a). 


sı = 21 Yi and s2 = Peg Here the seeds are constituted by samples 


{ež i = 1,...,20} from a Gaussian variable of 0 mean and standard deviation 
20, as specified for E;; then, by inverting (4.7) and (4.8), respectively reading 
sı = mă + O™, ef and sy = b + 0%, et (ai — T)/ I, (ai — F)?, we compute 
a and 6 for each sample. Figure 4.8 shows that the whole 0.900 confidence 
region computed from these statistics is spanned by the lines and points of the 
5000 samples. Only a small percentage, around 10%, falls outside. We avoided 
plotting the points for the sake of figure readability. 


Example 4.3. If the variance of the E;’s is unknown, to bind o we need as usual 


a further twisting argument. The relation 
(o <7) (Su -7< SOG - i) (4.60) 
i=1 i=1 


m 
pivots around statistic Soi — 7)? which is sufficient and minimal only when 
i=1 


coinciding with Soi — a — (x; — T))? for some a and b. This happens when 
i=1 


m m 
Sou =a’ and So ila: — T) = b' for some a’ and b’. Thus joint sufficient 
i=1 i=1 


If the drift | 
variance too is 
not fixed ... 


216 Regression theory 


Fig. 4.8: Coverage of the algorithmic inference regions in Fig. 4.6 by 5000 bootstrap 
lines. Lines: bootstrapped populations; gray region: collation of lines falling in the 
confidence region. 


Fig. 4.9: Comparison between the line confidence region of Fig. 4.6 (gray region) and 
the analogous region with unknown ø (black region). 


statistics must be pledged by two points (see Example 2.15 and Definition 2.8). 
In this case the random variable 


(4.61) 


follows a Student’s t distribution with m — 2 degrees of freedom. Therefore we 
must substitute the quantiles of this distribution to the quantile of the Normal 


Linear regression 217 


-10 


distribution and 


formulas (4.57-4.59). Figure 4.9 shows the enlargement of the above regression 
line confidence region due to the lack of knowledge about o°. 


Example 4.4. If measurement errors E; are exponentially spread around the 
origin according to the distribution 


1 
fe: (E; A) = sae" 


whose shape is shown in Fig. 4.10, the distribution of the relevant variables 
have still a close though complex form. Namely, starting from SË, it figures 
as the sum of many variables like E;, but each having a different parameter 
A =A (1 + Ses k After some algebra according to Sec. B.5 we discover 
that its density has the following form 


fse A= kel (4.62) 


i=1 


where 


IDs 


ky = i e A 


i—1 m 


Hœ- IL o-a) 


j=1 j=i+1 


peanti (4.63) 


replace it with an 
estimate. 


We can manage 
other exponential 
drifts 


possibly through 
more complex 
distributions 


218 Regression theory 


Fig. 4.11: Algorithmic confidence region for a regression line and points scattering 
with double exponential drifts with à = 0.0513 around the line. Sample generated 
from a reference line y = 5 + 5(x — T), with X uniform in [4,6]. Same notation as in 
Fig. 4.7. 


The variable SZ’ follows the same distribution law, now referred to m + 1 ad- 
dends, where the last has parameter exactly A/m. In principle Seg follows the 
same law, but we need considering the limiting distribution for equals \’s which 
nullifies the denominator in (4.63). This brings to the following distribution 


law: i 
; A) = ———— re?! 4.64 
fse(@ 9) = qrr en Ae) (4.64) 
where the functional coefficient c,, is defined through the following recurrent 

relations: 

ale) = cG_1(x)(2(i—1)—1) +27 e_2(x) Vi>3 (4.65) 
c(t) = «+1 (4.66) 
eilg) = 1 (4.67) 


Then the procedure works in perfect analogy with what has been done in the 

previous example (see Fig. 4.11). In particular, since the moment generating 

possibly function of Normal distribution and standardized symmetric exponential one 
preserving Some almost coincide around the zero of their argument, the independence condition 


g crucial 
independences. between Aa + Ab and Aa — Ab is approximately satisfied also in this case. 1 


4.2.7 The multidimensional case 


In case of many This procedure may be easily extended to n independent variables, i.e. X € 
Saabi Apu ¥” C R”. In this case, denoting with zj; the j-th component of the i-th sample 
vector, the model is: 


Yi =a + bı (£ii — T1) +... + bn (£ni — Tn) + £i, E= E 24) (4.68) 


Moving to non linear functions 219 


with analogous inequalities between coefficients of the hyperplane mediating 
(4.68) over £ as in (4.5) and a hyperplane y = a’ +b) (x1 — T1) +... +b, (an —Fn) 
to ensure the dominance of the latter, depending on the R” orthant we are 
considering. Namely, denoting by Ort a generic orthant with vertex in X, i.e. the 
set of x with components x; in a pre-assigned half-line on the left or right of %;, 
and denoting by signo,,(7;) a variable assuming value +1 if we are considering 
the half-line (%;, +00) and —1 if we are considering the complementary half-line 
(—co, Ti], the general condition is: 


(y; > yi Vx € Ort) & (a > aA (signoy(xi)b; > signo,(ai)b; Vi)) (4.69) 
Hence we have analogous c.d.f. 
Fron (I2") = 1- Feu (—mAa—signgy(#1)mAby—...—signg,(an)mAbn) (4.70) 


mw 


where SE” is a Gaussian variable with mean 0 and variance 0%(1+ 3°). i) 


J=1 Sa52; 
in the orthogonality hypothesis that all sums )>\" | (xj; — Tj)(£ki — Tp) are zero 
for any j € {1,...,n} and k € {1,...,n} with j #4 k. Hence we may identify 
the confidence region by spanning the ¥” x Y space with hyperplanes satisfying 
the conditions: 


—( E )i—y/anta/m + |AA| <AB,+...+ AB, < (SE): —y/2n41/m = |AA.71) 
—Za_ja/™M < AA < 2~7/4/1A.72) 


where the divisor of y is due to the fact that we must jointly satisfy conditions 
in each of the 2” orthants. Actually, to get a tight region it is convenient to 
span the hypeprplane coefficients moving from the one with smallest variance 
to the one with the greatest variance. If orthonormality hypothesis does not 
hold, we may always compute a rotation of ¥” exhibiting this property on the 
sample data, compute the confidence region in this framework and rotate back 
x”. Following this procedure in Fig. 4.12 we report the confidence region of a 
regression plane where the dependent variable is modeled as a linear function 
of two independent variables. Also in this case the contours of the confidence 
regions have a close analytical form that we may find in strict analogy fo (4.36- 
4.43). 


4.3 Moving to non linear functions 
Let’s return to the initial model 

Zm = {(@i, Yi), Vi € Xy; = clrc h i= 1,...,m} C (x Y)™ (4.73) 
with c € C. We will apply our procedure also in this general case. Here below 


we consider two case studies where the regression curve is non linear and also 
the dependence on noise terms is far from the typical linear model (4.5). 


we simply extend 
the formulas 


getting 
hyperplanes as 
edges. 


Same procedure 
with non linear 
models 


An occurrence 
attitude changing 
over time 


220 Regression theory 


Fig. 4.12: 0.900 confidence regions for a regression plane assuming a Gaussian noise. 
(a) Landscapes: confidence region bounds; plane: ML estimate. Points: sample ex- 
tracted from SMSA dataset [pollution and mortality, 2005], where xı refers to demo- 
graphic characteristics, x2 to the concentration of sulfur dioxide, and y to the age 
adjusted mortality. (a) the surface are identified by their contour edges; (b) magnifi- 
cation of the central part of (a) using filled surfaces. 


4.3.1 Confidence intervals for the hazard function of survival 
data 


Although it has some peculiarities that set it in a somewhat intermediate posi- 
tion between linear and non linear models, the former case study we will show 
offers the twofold benefit of a relatively simple formalization and a high op- 
erational interest. Namely we will afford the problem of inferring the hazard 
function from survival data. 


Definition 4.2. Being fr the d.f. of T and Fr its c.d-f., the hazard function of 
T is defined as 
fr) 


Me) = TH (4.74) 


In words, this function may be interpreted as the probability density that 
an event occurs exactly at time t given that it did not occur before t. For the 
negative exponential variable T (see page 343 in Appendix B) we have that h(t) 
is a constant equal to A. In many operational frameworks however, e.g. plant 
reliability or health care, h is a true function of t taking values on the same 


Moving to non linear functions 221 


support = of the random variable T that we want to explain through h. Thus 
the reference model may be the following: 


Zm = {(ti,7:), 71 € 2,7) = g(h(%), €i),4 =1,...,m} C (Rx T)™ (4.75) 


with g as a part of T’s explaining function. Hence h(t) is a curve in t to 
be inferred from the t; produced by g(h(ti),¢;) in (4.75), with a mechanism 
analogous to the production of y; = a + by; + £; in a linear regression model. 
The differences consist in: 


e t; is a fixed point in the recursive equation t; = g(h(ti), €i); 
e the random part of t; does not add linearly to the deterministic one. 


We confine the random part of t; directly on the uniform variable in the 
sampling mechanism. In particular we look for a regression model of the form: 


with 1/6(t) playing the role of A(t) in a nonhomogeneous negative exponential 
distribution law described by a c.d.f. like: 


Fr(t) =1—e/8 (4.77) 


(compare (4.76) with (2.3)). The non linearity of the problem is questionable. 
Form (4.76) may look like an additive model, since we add In(— In u;) to In 8(t;) 
to obtain Int;. But t; is a fixed point into a possibly nonlinear function of t;, 
and this renders the t dependency on In(— lnu) more complex than linear. 

In particular we will consider the following two cases: 


1. D1 
B(t) = BG," Bo>0, B >e™ (4.78) 

with cdf, = 
Fr(t)=1—e 70" (4.79) 


sampling mechanism 


t= (T ; age = Bo(— Inu)) (4.80) 
and hazard function 
1 
he) = E'A +g) (4.81) 
2. D2 
4178 
with c.d.f. 


Fr(t)=1- e7“ (4.83) 


i.e. a R 
self-referential 
sample 


A non linear 
sampling 
mechanism 


A pair of models 
for hazard rate 


rather, of 
representations. 


First and second 
order moments in 
a logarithmic 


scale 


222 Regression theory 


sampling mechanism 
1 
1 3 
t= (-=1nu) (4.84) 
a 


and hazard function 
h(t) = abt? (4.85) 


It is easy to recognize that D1 and D2 are different representations of the 
same Weibull model [Juran and Gryna, 1988]. You pass from the former to the 
latter just by stating a = 1/8) and b = 1 + In 81. Moreover, since with D2 
representation we have 


m b—1 
L(ti,. imja, b) = et X 5 t gmp (i s) (4.86) 


i=l 


we have no synthetic jointly sufficient statistics. Hence we must rely on well 
done statistics that we try to discover within the two representations. We will 
pass from a quasi-analytical computation of the joint parameters’ distribution, 
allowing us to draw curves’ confidence regions with great approximation in D1, 
to a brute force bootstrap procedure for determining the same regions in D2. 


4.3.2 D1 representation 


CD We focus on the statistics: 


m D, (a) -30) 


a = a an -mp = ape (4.87) 
and a 
aa In 6o + glu 
so = lnt = ESNA (4.88) 
where 
glu) = In(— Inu) (4.89) 
Tay = — S glu) (4.90) 
Int = S iak (4.91) 


Moving to non linear functions 223 


Equations (4.87) and (4.88) show that so and sı are well defined for each cp 


Bo and 61, moreover sı is monotone in (; and, for fixed (1, also so is the comp iexity of 
e mode 


monotone in 39. Hence they suggest the following logical implications: reverses on, 
- dependence. 

(91 < f1) + (sı 2 sz,) independently of Go (4.92) 

(Bo < Bo) & (so < 85) for fixed By (4.93) 


where s5, denotes the value assumed by sı after a shift of 81 to Ba, and 
analogously for 83,° 
Now, the second implication is conditioned to the value of the parameter 
involved in the former. This means that we may adopt an order relation 
on the parameter B = (6o, 81) whose contour lines, in the plane having 
its components as coordinate axes, are represented by angles constitut- 
ing the left-down quadrant of crossing lines that are parallel to the axes, 
namely with edges {60 = 90,41 < Ai} and {60 < 6o, 61 = Fi} for any 
(Bo, Bi). We compute the joint probability distribution of the two pa- 
rameters through the chain rule induced by (4.92) and (4.93). Equation 
(4.87) shows that the Sg, distribution depends only on 81, but we have no 
analytical form of F; S Hence we use an intermediate procedure between Bootstrapping 
a pure analytical and entirely numerical ones. Equation (4.87) indeed §@si¥ scalable 


distribution laws. 
shows that the whole boils around generating once the distribution law 
2 


iy, i (aU) — gU) ) . We may assume this statistic representing S1 
in the case 3; = 1. On the basis of n (say 10,000) samples of m speci- 
fications of U each we compute Sı empirical c.d.f., which we denote Ft, 
as template and obtain analogous functions F^! with other values of this 
parameter just by rescaling the F argument by a factor Ginny: Hence, 
from the probability companion of (4.92) 


Fp, (ĝi) = Fs, (s1) (4.94) 


we get the y-quantile of 61, denoted (3), as the curve parameter of the 
F°: crossing the point (s1,7) in the plane (x, F°! (x)). This simply means 
that (1 + In(G1),)? equals the ratio between the y-quantile of F! and s1. 


We cannot do the same for ĝo since we cannot express the distribution 
of Bo|G1 as a function of g(U) that is independent of the distribution 


—__\2 
of Bı, as the latter is a function of the variable 7)", C — aU) 


Hence we directly find this distribution by bootstrapping populations that Bootstrapping 
are compatible with the statistics so and sı for a fixed 3,. Namely we gonditional 
compute the conditional distribution of Bo|3, by applying (2.8) to the 
parameters of these populations for a suitable discretization of 61. The 

oblique edges of the parallelogram in Fig. 4.13(c) are a linear interpolation 

of the 1 — do and do quantiles of log Bo|fi. From (4.88) we see that, 


beside the mentioned correlation between the functions of U, the log Bo|G1 


Either an 
approximate 
solution via plug 
in method, 


or a correct 
„solution via 
peeling method. 


224 


NR 


Regression theory 


conditioning parameter simply shifts the Bo distribution by a factor (1 + 
ln 61). Actually, the interpolation almost overlaps the original data, and 
the slopes of the two edges shift from the above factor by a small quantity 
accounting for the U functions’ correlation. 


We get a domain % representing a joint confidence region for the parame- 
ters Bo and Bı by spanning the plane (6o, 31) with segments delimited by 
the pair of quantiles ((89|G1)5,/2 and (Go|91)1—6,/2 from the variable Bo|(1 
for each 61, and ranging 3, between the quantiles ((31)5, 2 and ((1)1~5, /2 
(see Fig. 4.13(c) with ðo = 0.05 and 6; = 0.05). Plugging points from ® 
into (4.81) we obtain an approximation of the confidence region Ð (see 
Fig. 4.13(a)) as in Definition 4.1 for the random function H(t). Indeed we 
have the implication 


((Bo, Bi) EB) > (A(t) CD Vt) (4.95) 
which minorizes the confidence of 1 — yọ of D through 
1g > (1-40) (1 — 61) (4.96) 


The inverse implication fails since hazard functions lying in the above re- 
gion may not satisfy the mentioned bounds on the parameters, pushing 
Bo, 61 out of X. This denotes a lack of bijectivity, for short, as a conse- 
quence of non sufficiency of the used statistics. 


Hence, if (conservatively) approximate equality in (4.96) is not satisfactory 
we move to the bootstrap population of our regression curves. Namely, 
we solve (4.87) and (4.88) in ĝo and () for seeds {uj,...,u*,}, then we 
plug the pairs into equation (4.81) to get specifications of the hazard func- 
tion. With thousands of these curves we span the (t, h(t)) plane in a way 
analogous to what we did before. The difference lies in the ranges of the 
spanning parameters: fixed within quantiles there, arising from simula- 
tion here. Hence the plane coverage we obtain is representative of the 
whole H(t) distribution, and we obtain a confidence region by peeling 
this set. Namely, the continuity hypothesis on the parameters translates 
into a connectivity hypothesis [Falconer, 1960] on the region spanned by 
h(t), now not prevented by constraints on parameters so that every curve 
h(t) in D1 falling entirely in the region is reckoned in the computation 
of the region confidence. This happens because it is possible to prove 
that every regression curve compliant with model (4.81) may be drawn by 
bootstrapping the parameters, hence without suffering the mentioned bi- 
jectivity lack. Thus a sample obtained with this method is representative 
of the whole curves population. Consequently, our method of peeling the 
set of curves to obtain a confidence region — a way that we call peeling 
method — consists of pruning the bordering ones. In particular, the linear- 
ity of In h(t) figures each curve enhancing shifts from the core in at least 
one of the two extremes Int < 0, Int >> 0. Hence we prune the sheaf of 
lines in the logarithmic plane by around 9.75% of its elements by circularly 


Moving to non linear functions 225 


PV 


visiting the four corners of the (Int, Inh(t)) picture, with 107° < t < 60 
(see Fig. 4.13(d)), and removing lines containing contouring traits in order 
to obtain a simply connected 0.9025 confidence region (peeling method). 
Obviously the accuracy with which we meet the confidence increases with 
the number of bootstrapped curves, therefore, whenever is the case, we 
are used to speak of asymptotically exact confidence region. Fig. 4.13(b) 
shows the shape of the obtained confidence region in the plane (t, h(t)), to 
be compared with Fig. 4.13(a) (as 0.9025 = 0.95007) — with some caveat 
that will be discussed later on with the numerical experiments. 


We simply rely on the ML parameters that plug in (4.81) to get a MLE of 
the curve. In spite of some visual off-centering of the MLE curve within 
the confidence region, the percentage of points belonging to the bootstrap 
curve population and falling below the MLE curve is 0.474, with a shift 
from 0.5 explained by the asymmetry of the curve distribution. 


4.3.3 D2 representation 


CD 


PV 


Starting from D2 representation we identified the following statistics: 


m (mt)? m (g(ui) + In 4)" 
s) Z Xil n 7 = Xia (g(us) n+) 5 (4.97) 


(4 Inti) 


si = ine =F (ols) = gluma) (4.98) 


i=1 


where tmin and Umax denote the minimum and maximum values taken 
respectively by t; and u;. The latter statistic has a monotone course with 
b. Instead sp has an asymptote when 57", nt; = 0, a condition that 
occurs with a near equal e~ ©”, where Cg is the Euler-Mascheroni constant 
[Mascheroni, 1790] whose value ~ 0.56 equals —E (g(U)). It increases with 


a before this value, decreases after it. 


Actually, when (2.5) has more than one solution in @ we do not know how 
to distribute the (uj,...,u*,) probability density on them which makes 
the bootstrap method indeterminate, and actually motivates the general 
monotonicity requests. The asymptote we meet in our case however re- 
moves this drawback. We must just decide whether locate ourselves before 
or after it since the two solutions we find for each (uj,...,u*,) are incom- 
patible. Indeed, denoting by a the asymptote abscissa and by fa the 
probability density of A, both f“ _ fa(a)da and [°° fa(a)da singularly 
equal 1. 


We decide to select the part where the MLE of A falls, expecting a tight 
relation between this statistic and the mode of A, even if we cannot prove 


Lack of | 
monotonicity 
asks us to solve 
an indeterminacy 
problem; 


226 Regression theory 


Fig. 4.13: 0.9025 confidence regions of Weibull model hazard function : (a) confi- 
dence region via plug-in method; (b) confidence region via peeling method; lines: 
bootstrapped hazard functions, black lines: pruned curves, gray region: collation of 
remaining lines representing the confidence region, dark gray line: hazard function 
MLE; (c) parameters’ range at the basis of (a); (d) logarithmic scale representation of 


(b). 


Moving to non linear functions 227 


h(t), 


5 


4 


Fig. 4.14: Confidence regions as in fig. 4.13 based on less suitable statistics: (a) and 
(b): the analogous graphs of fig. 4.13(a) and fig. 4.13(b). 


it to be a sufficient statistic. Hence we pivot our sample/population prop- 
erties on Sg as well, according to the following implications: 


(axa) & (s¢</2 so) (4.99) 
(b<b) & (s>s) (4.100) 


where < / > moves to either < or > as said before. 


NR Once the a asymptote versant had been selected, we may draw a confi- 
dence region for the parameters in a way analogous to what we did for 
D1. For instance we may find a confidence interval for In B, by the first, 
given its linear relation with Ins; and In >>)", (glui) — g(umax)). Then 
we numerically evaluate the conditioned quantiles of In A|b and repeat 
the steps with D1 getting definitely worse confidence regions. We get a 
greatly broader region by the mapping from (a,b) to (t,h(t)) as a con- Definitely worse 
sequence that: i) statistic sh lacks of monotonicity, and ii) statistic s| is “54 
based on the extremal value tmin whereas T is unbounded, so that they 
are both non well behaving statistics. As we may see in Fig. 4.14(a), these 
drawbacks are magnified by the mentioned lack of bijectivity, but their 
consequences do not disappear even if we work directly with bootstrap 
populations of the regression curves with analogous procedure as with D1 
representation (Fig. 4.14(b)). 


4.3.4 Checking the method 


Example 4.5. Consider the following reconstruction problem. We start from A reconstruction 
3 : : š 2 è : : oblem. 

a distribution laws within the Weibull model, namely fixing its parameters in e 

order to have mean = 8.67 and standard deviation = 6.47 , draw a sample of 100 


Going through 
the 
approximation of 
the plug-in 
method. 


Two charts for 
the same curves 


Well centered 
regions 


Confidences as 
expected 


228 Regression theory 


Fig. 4.15: Hazard function confidence regions in a reconstruction task. (a) r-chart. 
Thin black curves: spanning with discretized parameters within their 0.9500 confidence 
intervals; bold black curves: contour of approximately 0.9025 confidence region by 
plug-in method; thin gray lines: 0.9025 coverage; dashed bold black curves: contour 
of (asymptotically) exact 0.9025 confidence region by peeling method. Bold gray line: 
original hazard function. (b) z-chart. Points: bootstrap parameters consistent with 
the sample statistics; gray points: parameters contained in the peeling confidence 
regions; black points: parameters external to the peeling confidence region. Dashed 
contours: peeling confidence region; plain contours: plug-in confidence region. 


specifications of T, and aim to compute the 1— y = 0.9025 confidence regions as 
in Fig. 4.15. Using statistics so and sı we draw the confidence regions with both 
plug-in and peeling methods, setting to 5,000 the number of bootstrap replicas 
for both computing parameters’ distributions and appraising the coverage of the 
confidence regions. As customary, our questions concern the possible popula- 
tions and related regression curves having a sample with the actually observed 
statistics as a prefix. Although our final goal is to draw a confidence region for 
the hazard function as in Definition 4.1, we will visualize the results on both 
the parameters (60, 31) space as in Fig. 4.15(b) — that we denote as parameter 
chart (7-chart, for short) — and in the (t, h(t)) space as in Fig. 4.15(a) — that 
we denote as regression chart (r-chart). The principal tool for appreciating the 
accuracy of the confidence regions is their coverage in the r-chart. While it 
matches the confidence level with the peeling method, by definition, its study 
in the a-chart clarifies some weakness of the plug-in method as a counterpart 
to its greater computational efficiency in some cases. 


e First of all we remark that the original hazard functions lie in a barycen- 
tric position within the confidence regions, with a displacement w.r.t. the 
center line only deriving from the skew of the involved distribution laws. 


e Plug-in method underestimates the confidence of the region it finds by 
around two centesimal points as a consequence of the single side implica- 


Moving to non linear functions 229 


tion (4.95), since we have a coverage 0.9232 against the 0.9025 requested 
confidence. 


e The probability gap is clearly realized in the a-chart if we consider the 
reverse mapping from the curves belonging to the confidence region to 
their images in this chart. Here the gray points represent all the curves 
in the confidence region obtained with the peeling method. Hence, points 
outside % domain obtained with plug-in method represent: 


1. curves that belong to this region in the r-chart but are not book- 
kept for computing the probability of BY. Given the inclusion relation 
between the region obtained with the two methods, these points rep- 
resent a subset of the intruder curve not taken into account by the 
plug-in method. Namely, gray points trespassing the plug-in contour 
in the 6o direction represent straight lines lying within the confi- 
dence region in the (Int,Inh(t)) plane, where an adequate shift of 
(3, compensates a variation in the Bo parameter out of the interval 


[(bolb1)5/2; (G0|F1)1—s21; 


2. curves that only partially lie in the the r-chart plug-in region. They 
trespass this region at values of t either ~ 0 or very high that have 
not been considered in the peeling of the confidence region. This 
process indeed hits a compromise between: i) the measure of this 
swarm of curves that broaden the confidence region with Int out 
of the range of expectable values — that we denote as operational 
range, and ii) the width of the neck of the region in correspondence 
of the most probable values of t, that would be increased by curves 
compensating the removal of the above swarm. 


e We may relieve the above differences by a simple smoothing of the contours 
obtained with plug-in method with the side benefit of downsizing the 
region extension. The smoothing may be done either empirically (fast way) 
or through suitable depth functions (with some additional computational 
load). 


Example 4.6. Here we are interested in the illness reoccurrence time of a patient 
after surgery [Marubini and Valsecchi, 1995, Boracchi and Biganzoli, 2001]. We 
have a set of 21 uncensored Leukemia reoccurrence times [Freireich et al., 1963] 
and look for the hazard function of their explaining function. In Fig. 4.16 we 
report the contour of the confidence region obtained for the hazard function 
on the basis of so, sı statistics. We also draw the MLE of the hazard function 
as deputed pivot of the region. Contrasting this region with the companion 
one in Fig. 4.15, we notice the former wider. This is due not only to the 
smaller sample size but also to the degree of adequacy of the adopted models 
to describe the data. Indeed, now the data are not sampled from the model 
itself, like in the previous experiment. The sole real connection between the 


due to two sets 
of intruders: 


camouflaging 
curves 


unmasked curves. 


Possibly some 
handicraft 
improvement. 


Leukemia hazard 
rates 


A nonlinear 
mixed effects 
model 


230 Regression theory 


Fig. 4.16: Hazard function confidence regions from real data. Black curves: contour of 
(asymptotically) exact 0.9025 confidence region by peeling method; gray curve: MLE. 


models underlying data in the two experiments is represented by the values of 
the mean and standard deviation that we fix in the reconstruction experiment 
exactly equal to their sample values in the leukemia data. Actually, graphs in 
Figs. 4.13 and 4.14 refer to these experimental data as well. 


4.3.5 A non linear mixed effects regression model 


Second order The following model is employed in the literature for studying the uptake of 
statistics . . : T : : : 
B-methyl-glucoside in the guinea pig intestine. It is assumed that the uptake is 
due to an active membrane transport which is loosely governed by an enzyme 
like reaction of the form: 


bitij 


a ae hë 4.101 
atag ig Te ( ) 


Yij 
where x is the concentration of B-methyl-glucoside, and y the uptake. The 
constant bı represents the maximal rate of uptake, b2 an affinity constant and 
b3 a diffusion constant. We distinguish between parameters affecting clusters 
of data — namely b2, and b3, depend on the i-th pig (individual henceforth) — 
and overall parameters — bı and the variance oe of Ei; that is assumed to be 
Gaussian with 0 mean. Hence we read y;; as the j-th observation on the i-th 
individual denoting a value x;; of the independent variable and affected by a 
noise term €;j. This kind of model is analytically identified in the literature 
[Lindstrom and Bates, 1990] through maximum likelihood point estimators of 
the parameters under the additional hypothesis that individual depending pa- 
rameters distribute according to a bivariate Gaussian distribution with mean 
u = (b2,b3) and covariance matrix 4? to be estimated as well. Alternatively 
we may get the parameter estimates through a common two-step Expectation- 
Minimization procedure [Bishop, 1995] aimed at minimizing the parameters’ 
MSE [Patron-Bizet et al., 1998]. The nonlinearity of the relation among the pa- 
rameters prevents easy evaluations of the statistical spread of these estimates, 


Moving to non linear functions 231 


hence of any confidence interval, with these approaches. On the contrary, con- 
fidence intervals are the favorite task of our framework. We implemented the 
above mentioned procedure through the sequence (CD — NR — PV) as follows: 


CD 


NR 


Forgetting the additional hypothesis, model (4.101) expresses a regression 
model that is linear in the noise and nonlinear in the parameters. The 
different locality degrees of parameters with individuals let us decouple 
the uncertainty in the model parameters from the uncertainty in the noise 
variance a}, as follows. We identify o% as the parameter allowing the }y,, 
explaining data gathered per individual, to coincide. In greater detail we 
singularly adopt the following moments’ equations on each individual: 


bz; Sy, + Oxy: = (bi, + bo, bs, ) Sx, b3, Sx? b2;0 p Ses + OBS ze 
bo; Sui + Sy, = m(bı, + b2,b3,) + b3, Sx, + b20 pS cx + op Sg4.102) 
ba; Saiyi + Szy; = (bi; + bo, 63; ) S22 b3; S23 b2, 0B Sze + OES y2¢* 


which we solve obtaining {bj,,b2,,3,} as a function of og on each seed 
(Ef1,--+;&5m,), Where n is the number of observed individuals, m is the 
number of observations for each individual, £}; are specifications of a Nor- 
mal variable = and Sp,y;,5, etc. are the observed statistics defined as 


the sum over all the observations referring to the same individual of the 
quantity in subscript, i.e. Se; = ) j- Vig Vij, Su = ye ae etc. Then 
we start a loop to converge to a unique bı and Op. Namely, on each seed 
we compute Og as the value of og minimizing the variance between the bi, 
arising from each individual. Then we definitely assign to bı the average 
of obtained bı, and solve again (4.102) in bz,, b3; and og,. Finally we also 
definitely assign to Op the average of found og, and solve first and second 
equality in (4.102) to obtain definite values of bz, and bg, as well. 


In this way we obtain the bootstrap populations for first four individuals 
in Fig. 4.17. 


Without expecting to solve this case analytically, we peeled the popula- 
tion curves generated in CD block (gray lines in Fig. 4.17) to obtain the 
0.900 confidence regions drawn in the figure (light gray curves) with a 
similar procedure as in Example 4.5. However, the peeling may depend 
significantly on the x operational field. In other words, the curves that we 
checked to lie in the envelope of the others for x ranging in a given interval 
(say (0,100)) may trespass the envelope for x far from these values (for 
instance for x = 150). Moreover, also the peeling sequence may change 
the shape of the confidence region. For this case study we adopted the 
interval (0,50) as an x operational field, and decided to proceed uniformly 
in parallel by identifying points of curves outside the envelope of the oth- 
ers, then removing in parallel the curves which these points belong to, and 
repeating this loop until y/2 curves are removed from upper side and 7/2 
from the lower one. 


plus an 
expectation- 
maximization 
procedure 


and a very 
essential set of 
hypotheses. 


232 Regression theory 


Co eo oO oO o OF 
Ceo oO oO Oe & 


(c) (d) 


Fig. 4.17: 0.900 confidence regions for 4 guinea pigs. p axis: B-methyl-glucoside 
concentration; v axis: B-methyl-glucoside uptake; points: observed sample. Gray lines: 
pruned bootstrap curves’ population. Light gray lines: confidence regions emerging 
by peeling the populations. Black curves: median curves with common b1. 


PV In order to find the pivot of the distribution, MLE may be a solution, but 
it requires the mentioned additional hypothesis on the parameter shifts 
among individuals and in any case is computationally hard. Here we chose 
to work with median curves, as we expect the median closer to the mode 
than the mean in a unimodal distribution law. Namely, we computed the 
median ordinates in each graph rastered as mentioned before and fitted 
these points with a curve of our family. 


In conclusion, in this case we faced two kinds of noise, an observation specific 
noise €;; that refers to each uptake observation, and a class noise that has a 
common value for each individual. The latter does not appear explicitly, as it 
is concealed in the pig differentiation with respect to b2 and b3 parameters. Its 
effect reverts in the fact that moment equations result non linear in some of the 
parameters to be inferred. This is not a drawback for the Algorithmic inference 
approach, apart from a possible heavier computational load. As for the rest, we 


Adequacy test of a model 233 


Fig. 4.18: Companion confidence regions spanned by Weibull c.d.f.s with hazard func- 
tions ranging in the analogous peeling method regions in Fig. 4.15(a). 


may draw a confidence region as in the linear models and the pivots we found 
are comparable, though different, to those found with MLE (see [Lindstrom and 
Bates, 1990] for instance) and present a clearer interpretation. 


4.4 Adequacy test of a model 


In the last section we dealt with two models for explaining a sample of data, 
estimating their free parameters or some function of them. An obvious question 
concerns which one of the two is more suitable for explaining the data, once 
the free parameters are estimated. This is the typical question faced in the 
conventional statistic by the tests of hypothesis. The key difference however is 
that in the mentioned framework a question concerns mainly the free parameters 
of a given model and the answer relies on the same statistics we could use to 
estimate these parameters. For instance a question may be “is A = 0.5 more 
adequate for explaining a given sample of a negative exponential variable than 
A = 0.8?”. In our approach we have a level jump, since the adequacy question 
concerns the model whose free parameters we have estimated. For instance a 
question may be “is model D1 better suited than D2 to explain a given sample 
of a non homogeneous negative exponential variable?” . 

To check the adequacy of the two candidate models, we use results from 
order statistics theory [Tukey, 1947]. From a sample {t,,...,tm}, we obtain 
(t(1),.--,¢(m)) by sorting the values in ascending order getting intervals 


(—00, tal tap tol --- (tm-1): tim], (Lamy, +00) (4.103) 
E 


These intervals are called equivalent blocks since the measure AFr(i) = 
Fr(ta+1)) — Fr(ta)) for each i (with obvious extensions tig) = —oo and 
t(m+1) = +00) follows a Beta distribution law of parameters 1 and m [Tukey, 
1947]. On the other hand, the confidence intervals for the parameters of the 


A non parametric 
symmetric test of 
hypotheses 


Which is the 
better, D1 or 
D2? 


Sorting the data 
we obtain order 
statistics 


hence equivalent 
blocks according 


to a Beta 
distribution 


234 Regression theory 


0.010.020.030.040.050.060.075 0.010.020.030.040.050.06 b 


(a) (b) 


Fig. 4.19: Beta distribution function (continuous line) and boundaries empirical dis- 
tributions (points) for the sample blocks’ measure, based on the samples of the recon- 
struction problem as in Fig. 4.15. 


two models give rise to confidence intervals for the corresponding cumulative 
-hence an distribution Fy’s, as in Fig. 4.18. Focusing on a single model, we get a range 
See nuen (AFP (t))max — (AF r(t) min) for the measure B of each block when Fr spans the 
confidence interval. Thus we obtain two samples of B, one from the lower, the 
other from the higher extreme of the above ranges when i varies from 1 to m+1. 
From these samples we compute two boundaries empirical cumulative distribu- 
to be cast by tions. Reversing the twisting argument we expect the above Beta distribution 
distributions to lie within if the candidate model is adequate for describing the sample data 
computat m me and the family of its suffixes. We assume that the probability on the suffixes 
confidence with which the Beta distribution must be included in the above confidence re- 
gion equals the confidence of the region from which the above parameters for 

Fr have been drawn. 

We checked this method on the previous reconstruction problem in a some- 
how indirect way. Namely, we considered the confidence regions obtained either 
with the well behaving statistics and with the less appropriate ones, respec- 
tively in Fig. 4.14(a) and (b). The corresponding graphs in Fig. 4.19 show that, 
although the Beta distribution is inside the boundaries corresponding to both 
distribution laws,the gap between the boundaries is definitely much higher in 
the second case. Thus, even in the lack of awareness of the statistics quality, we 


are drawn to prefer the first distribution law. 
4.5 Point estimators 
As mentioned in Sec. 2.3, to take an operational decision we may need exactly 


one value to attribute to random parameters. For sure you cannot accept as 
output of a scale “a weight between 55 and 59 grams” for the bag of coffee a 


Point estimators 235 


customer wants to buy. You and your customer know that the measure accuracy 
may be limited, but you both need a number. Thus we must fix as a result of 
calibrating the scale a precise linear relation between the rotation angle of the 
dial hand and the weight we declare to the customer. In principle we may aim 
at minimizing an estimator risk # defined as in Sec. 2.3, but the definition of 
a suitable loss function may prove difficult as it is connected to some metric 
on the curves thus requiring to identify a density function on this metric in 
turn. In general operational contexts the pivot of 2 plays the role of minimizer 
of a general purpose risk whose definition is somehow vague, as it depends on 
mere computational choices about how to peel a population or how to order 
the curves, or finally on if it is more convenient, from a strict computational 
perspective, to compute a mode, a median, and so on. Here we recall the families 
of pivots used in the past examples. 


e Mode/maximum likelihood estimator. It is always possible to compute the 
parameters’ MLE, to be plugged into the curve equation, when we have an 
analytical form of their distribution law. Otherwise we must search for the 
maximum numerically over the bootstrap population of the parameters, 
and this may prove computationally heavy. 


e Median curve. If we are forced to compute numerically, identifying a 
median curve may be a feasible solution. We may plug into the curve 
equation the medians of the parameters or, more sophistically, decide how 
to manage the curve crossings or directly the candidate median curve. 


e Collection of points with local properties. Facing a confidence region for 
the regression curve, on each abscissa x we compute a relevant y. It may 
be the mean or median value, or simply the mean ordinate of the intercepts 
of the boundaries of the confidence region with the vertical line through z. 
All these choices look consistent estimators as they fall inside the above 
intercepts, where the latter go to coincide with the growth of the sample 
size. 


Example 4.7. All the above strategies give rise to the same estimators when 
we look for a straight line interpolation. Namely 


m 


. 1 

â = rm (4.104) 
X uili- 2) 

b = 5H (4.105) 
D 
i=1 


which, in synthesis, represent the solution of the two normal equations 


m 


X- (yi — a — (ai — Z)) = 0 (4.106) 


i=l 


A number in any 
case 


Mode — 
maximum 
likelihood 


Median curve 


Local properties 


Definitely no 
misunderstanding 


Inferring a linear 
relation within a 
linear model 


The master 
equations 


236 Regression theory 


J (yi — a — blæ: — 2) (vi — 7) = 0 (4.107) 


underlying the methods deriving from the mentioned criteria, as a sample trans- 
lation of the properties E/E] = 0 and Cov/E, X] = 0. 


Example 4.8. Consider again the data described in Example 4.6. Computing the 
In a less standard maximum likelihood estimates of parameters or a and b relatively simple. They 
context . . . . . . . . 
come from vanishing the derivatives of the sample likelihood, taking respectively 
either MLE the following relations: 


a = sg (4.108) 
4=1 “i 


-X Inti+a® t lnt; (4.109) 
i=" i=1 


m 
b 


Equation (4.109) calls for a numerical solution. Using these estimates we get 
the hazard functions reported in Fig. 4.20(a). 
Searching for unbiased estimators of the above parameters, as we may rely the 
or the median Numerical version of their distribution law, we approximate the mean with 
the median, in a loose symmetry assumption of the involved variables, and 
estimate it with the empirical 0.5 quantile. The hazard function obtained with 
or simplya the estimated parameters is plotted in Figure 4.20(b). In Figs. 4.20(c) we also 
consistent report for comparison sake the hazard function obtained as the half ordinate 
of the confidence intervals. We may realize that generally the three estimators 
loosely coincide. 


Fig. 4.20: Hazard functions obtained via: (a) the maximum likelihood method, (b) 
the medians’ interpolation, and (c) the half ordinate of the confidence interval. 


Estimation efficiency 237 


4.6 Estimation efficiency 


Learning is a sophisticated task where not only the involved statistics per se but 
also how you compute them make the difference. In the case of n-dimensional 
input space ¥ the regression line reads a hyperplane and the coefficient B be- 
comes a vector of coefficients B. Their coefficients MLEs move from (4.104) and 
(4.105) to 


a= 9 (4.110) 


b = (xxl)-1(%) (4.111) 


where x is a vector in ¥, x” its transpose, and d is a vector having as components 
the sample means of the d components. Thus this estimate involves the inversion 
of the (xTx) matrix, a typical heavy computational task having complexity 
O(n?) [Knuth, 1997] where n is the matrix dimension. Hence we may set up an 
incremental way for estimating B by following the descent of the MSE in the 
parameter space. That is, we will move along each b; direction of a quantity 
proportional to the opposite of the MSE derivative with respect to b;. This is a 
common minimization strategy called gradient steepest descent whose benefits 
and drawbacks we will discuss at length in Sec. 5.4.2.2. In particular, let us 
focus for simplicity’s sake on a b made up of a single component, i.e. returning 
to b of the line regression, and assume a to be known. Let’s assume we have 
an estimate b of B at time t, and so obtain for each element of the sample 
the value: yt = a + bixi. Instead the correct explanation is y; = a+ ba; + £i. 
Denoting by FE the current MSE value, if we introduce the quantity Ab; = b,—b, 
the error at step t 


= 
3 
= 


EB =3) m- 5 ((b — bya + €:)° (4.112) 
i i=l 


becomes: 


for 


=X — Abja;)? (4.113) 


Applying the gradient descendent pare we obtain the estimate of b at time 
t+1: 


0 
big = = bi — "Ob, E = = bi + De — Ab;zi) Ti (4.114) 
i=1 
which reads: 
Abii = Abt (: = "Ds) +> ex; (4.115) 
i=1 i=1 


after having subtracted b from both terms of the equality, i. e. Ab, = b — b. A 
decrease in the error term generally corresponds to a decrease in shift between b; 
and b. Indeed, the error variation between two successive times AE; = E41 — Er 
reads after (4.115): 


A good estimator 
is a quick 
estimator 


238 Regression theory 


1 m 1 m 
AE; = 3 >, (ei — Abiyi) -3 >, (ei— Abyx;)? = 
= 5 (Abe, — Ab?) ) 2? — (Ab2,, A) S em (4.116) 
i=1 i=l 


i.e. 


m 
(Athy = AB) = ie y + “Oe p se 
Dit Dia ti 
2AF, 2 (Abe ( doin 27) +N Xia Eiti) Dia EiT 
Som a TY) 
Di ti ia ti 

Disregarding the second addend, which is unpredictable since it depends on the 
noise, we may use AF; as an indicator of the learning efficiency. Specifically, 
we divide the original sample in k samples of size m/k. Then, as parameter 
b is the same, at each iteration we tentatively update the current b; according 
to (4.114) on each subsample and select the update causing the highest error 
decrease. Thus we get a notable convergence speed of b; to b in the initial phase 
of the learning sessions. We can appreciate it in Fig. 4.21(a) where the course 
of convergence is reported using the whole sample and the subsample strategy 
respectively (to maintain the order of magnitudes of the b increments, we used 
learning rates 7 and 7’ = n- k, respectively). At a certain point however we 
suffer from the relative exiguity of the data. This happens when further linear 
components in the residuals y; — yf w.r.t. xi, where yf is the ordinate computed 
by the regression line, are concealed by the data randomness. We check this 
with a test of hypothesis strategy similar to what we used to check the model 
adequacy in Sec. 4.4. In other words, now we fix a threshold for the statistic p: 


Daly- m—2 
p= $m T k (4.118) 
i=1\32 i 


where k is the x dimension, which we know to follow a Fisher distribution law 
[Wilks, 1962] of parameter m/k and 2 in absence of a linear relation between 
yf and y;. We decide to switch from the subsample strategy to the use of the 
whole sample when p starts decreasing its value (see Fig. 4.21(b)). 


4.7 Bibliographical notes and further readings 


Shifting from Boolean to real labels for examples, i.e. to the problem of learning 
a real-valued function, represents an unsuccessful step in conventional computa- 
tional learning. Apart from some molded attempts at translating this problem 
in an equivalent one on Boolean concepts, the issue is almost entirely neglected. 
A way of realizing this translation consists in defining a class of concepts having 
the shape of the confidence regions around candidate curves and learning this 


Bibliographical notes and further readings 239 


Er P 
4 
0.85 
3.5 0.8 
3 0.75 
0.7 
2.5 
0.65 
2 0.6 
1.5 0.55 
t 


20 40 60 80 100 120 140 20 40 60 80 100 120 140 


Fig. 4.21: (a) Behavior of current MSE E; with respect to the number t of iterations 
using full sample of m = 200 data (black line) and subsample strategy with size 
m/k = 50 (gray line); (b) Curve of test statistic p versus iteration number until the 
switching decision. 


class with a prefixed percentage of errors mimicking the confidence level of the 
goal region [Anthony and Bartlett, 1999]. More complex and less disingenuous 
efforts have been made by authors like Haussler [Haussler, 1989] invoking Lips- 
chitzianity properties (see footnote 15 in Chapter 2) of the goal function, in order 
to build suitable concepts around the map computed by the real-valued function. 
In general they thicken the graph of the function through spheres centered in it 
with typical arguments used in dense spaces of functions [Rudin, 1974] as in case 
of fractals [Falconer, 1960]. As a matter of fact, researchers abandon the ran- 
domness of concepts when they are non Boolean, and reverse it on error terms 
that affect the observations of the function and consequently the estimates of its 
parameters. This is the widespread regression theory framework set up by the 
statisticians before the advent of computational learning, as discussed in depth 
in many books such as [Morrison, 1967, Sen and Srivastava, 1990]. A modern 
evolutions of this theory is constituted by many interpolation methods based 
on splines [de Boor, 1978], wavelets [Mallat, 1998] or dynamical systems identi- 
fication [Goodwin and Paine, 1977]. Artificial Intelligence researchers generally 
preferred to overcome the limitations of this approach, in spite of the elegance of 
the developed theory and the huge number of suitable applications, by directly 
addressing themselves to the subsymbolic learning paradigm that will discuss 
in the next chapter, possibly in conjunction with fuzzy sets [Nauck et al., 1997] 
or other kinds of rough sets [Selman and Kautz, 1996] relaxations of the goal 
function. 

The theory in this chapter is essentially new, as it manages functions to be 
inferred as random items of a set of functions. This represents a natural exten- 
sion of the approach in Chapter 3. The need for a more ductile regression theory 
has been claimed in various occasions in the past, especially for biometric data, 


240 Regression theory 


where the simplifying hypothesis of a Gaussian noise drifting the observations 
from simple, possible linear functions, very often does not apply [Boracchi and 
Biganzoli, 2001, Meeker and Escobar, 1995, Apolloni et al., 2006a]. The primary 
results concern confidence regions. Rigorously moving from confidence regions 
single curve estimators would root in a functional analysis for optimizing func- 
tionals of the goal function. Rather, we look for more immediate tools with the 
benefit of getting relatively quick solutions with the conservative property of 
being consistent. 

The non parametric framework we used to check the adequacy of a learnt 
function is a classic one (for further reference, literature from the early to mid 
20-th century is rich in discussing of the matter [Fraser, 1957, Scheffe and Tukey, 
1945, Tukey, 1947]). How we implement the check is rather original. In the typ- 
ical test for checking the adequacy of a given distribution law to describe a 
sample — namely, the Kolmogorov-Smirnov one [Fraser, 1957] — we know that 
for any given distribution law the absolute value of the maximum drift between 
the cumulative distribution law and its approximation through the empirical 
cumulative distribution law (see Appendix B) follows a distribution law com- 
puted by the above mentioned authors. Hence, for a given sample we assume it 
to follow a distribution law if the above statistics falls within adequate quantiles 
of the Kolmogoroy-Smirnov distribution law. While similar, our rationale is im- 
plemented in a different way. In the center we put the cumulative distribution 
representing the mean of the block distributions compatible with our sample, 
and on the sides the analogous distributions computed on the two quantiles of 
the goal function, hence representing the quantiles of the block distribution. 

The preliminary questions we pose about the efficiency of implemented in- 
ference methods will be considered in greater depth in the next chapter. 


5 — Subsymbolic learning 


If we do not know the concept class C we cannot invent it. Rather, it is preferable 
to rely on an initial class of random functions, to be refined during the learning 
process. The main advantage of this strategy is we avoid prejudices that could 
render futile any attempt to further improve our knowledge (in other words, 
these prejudices could prevent us from working with consistent estimators, see 
Definition 2.14). We already have seen in Chapter 1 how to build by ourselves a 
uniform random number generator. Here we build up a random element A from a 
family of functions each computing any output for a given input. Thus, initially 
there is no bias about the value the built function will compute on a certain 
input; at the same time, it must be open to adaptation so as to reproduce any 
special function suggested to us by a set of examples. We identify this family 
by a function generator endowed with many nonlinear local functions collapsed 
via a rich set w of free parameters. The latter are so numerous and the former 
are so rich that, for a proper choice of the parameters, any computable function 
may be finely approximated by an element of the family. Thus learning coincides 
with the identification of w as the pointer spanning the wide class of concepts 
constituted by the above family. Initially, we pick the parameters at random, 
so that the function proves random, too. Then we modify them according to a 
very rudimentary twisting argument that reads as follows. 

For a set of instances ¥ consider a risk function Z whose minimization gives h 
as a solution. For instance, we want to minimize the MSE (see (2.102)) between 
the goal function c and the approximation h, which consists of the expected 
square distance between c(X) and h(X). As a function of future population, Z 
too is a random variable. Thus we are looking for a statistic S on the example 
set related to Z such that we may assume 


(s<s')e(r<r’) (5.1) 


where s and s’ are any two different specifications of S computed for two different 
assignments w and w’ of the free parameters, and analogously r and r’ are 
specifications of #2. Thus, beginning with w, we agree to switch to w’ since 
s < s', understanding from (5.1) that Z too decreases with this move. 

As a matter of fact, we just presume it is true. Generally no formal proof 
exists. After Theorem 2.1, we can really consider (5.1) as a twisting argument 


241 


A uniform 
generator of 
random functions 


i.e. a function 
with many free 
parameters 


uniformly 
specified 


plus a learning 
mechanism for 
assessing these 
specifications. 


A risk function Z 
for addressing h, 


a statistic S for 
monitoring 
through the 
sample the Z 
descent, 


provided that Z 
twists with S 


MSE versus s = 


E (yi — h(ws))? 


No sufficiency is 
guaranteed 


No convergence 
is guaranteed 


, A non trivial 
inference problem 
solved 


through a neural 
network 
generating 
almost consistent 
hypotheses 


242 Subsymbolic learning 


if 
i) & constitutes a parameter of a regular distribution on ¥ (see Definition 
2.4) and 


ii) S is a sufficient statistic. 


None of these properties may be proved in general. In addition, implementing 
population boosting methods is out of the question, because of the poorness 
of the inference framework. Paradoxically, twisting argument, which is more 
demanding in terms of involved statistics, is less demanding in terms of boundary 
knowledge than bootstrap methods, because a part of this knowledge is made 
up for with the involved logical structure. Vice versa we cannot bootstrap 
populations if we do not know the shape of its distribution law. 


Example 5.1. Let Z be the MSE of h w.r.t. C, namely MSE[h,C] = 
E|(C(X) — h(X))*] where C, as the explaining function of Y w.r.t. X in a 
suffix of a given sample {(z£1, y1), ---, (m, Ym)}, is the random goal function. 
The expected value is taken on both family of suffixes (and related explaining 
œs) and within each population constituting a suffix 1. Thus, with respect to 
the regression problem afforded in the previous chapter, on one side we abandon 
the idea of computing C distribution law, on the other we consider the X ran- 
domness in order to soften the search of a suitable function, whose suitability is 
mainly requested on the most probable inputs. Let us base the rough twisting 
argument on the statistic s = 37”, (yi — h(ai))” taken over the sample. With 
respect to the above conditions i), ii), we have that for m large enough: 


e We may assume that $/m follows a Gaussian distribution law with mean 
MSEJh, c]. But (C(X) — h(X))? does not follow a Gaussian distribution 
law in general, thus we cannot expect that S is a sufficient statistic. 


e The large numbers law insures that S/m is extremely close to MSE[h, c] 
with high probability. But we must consider that sentry points and pledge 
points gnaw the actual number of examples to be considered to appreciate 
m large. 


In very broad terms, we could say that if h and the distribution law of X 
were so simple to avoid one of the two mentioned inconveniences, then it should 
not be so hard to understand from past experience with good approximation 
symbolic features of the class to which h belongs. Vice versa, we are generally 
not dealing with a trivial learning problem. We can reasonably assume s to be 
a meaningful indicator about the E[(C(X) — h(X))?] behavior, provided that 
h is a sufficient statistic in regard to this parameter (note: still depending on 
h itself). When our target is a Boolean function, learning theory requests that 
h be consistent with the example set. This translates in a proper selection of 

1 According to our notation, C is the random function with specifications c ranging in C. 


Note that the expected value of (C(X) — h(X))? within a single suffix is the companion of 
U.+p considered in Chapter 3. 


243 


an item within a class of hypotheses sufficiently close to a concept class. Here 
neither the target is Boolean nor do we know the function family to which 
h belongs. Otherwise we would have to speak of a regression problem in the 
province of the past chapter. We address the h actual hitting in the parameter 
space by the rough twisting argument (5.1). Then we check the value of the 
solution on a new set of examples, which we call a test set. We call the h 
generator neural network (or NN for short). It describes a large concept class C 
constituted by all the input-output mappings occurring in correspondence with 
w specifications. In this scenario we will resort to sentry points theory to get 
some vague directions on sample complexity. But we will refer to more essential 
results from information theory to make a preliminary shaping of C, for instance 
by suitably binding the number of w components. A learning algorithm selects 
one of these specifications h that is supposed to almost minimize s. We check 
the left implication in (5.1) by generalizing h, i.e. evaluating s or other relevant 
statistics for Z on a new set of examples. In order to be representatives of the 
population to which 2 applies, they are (whenever it is possible) independent 
of the previous examples on which s has been assessed, hence unbiased by the 
learning process. Moreover, still in order to soften the request to the learning 
algorithm we generally toss the generalization capability on a sample of training 
and test sets, hence as a mean capability on a set of learning instances, rather 
than expecting to check it on each instance. 

As a permanent adaptation of some free parameters pursuing the optimiza- 
tion of some behavioral conditions, learning is the attitude we expect in the 
human brain. This is why we call our model neural network. Actually, later we 
will see how the analogy has a greater extension, involving the computing and 
adaptation mechanisms we hypothesize in the brain and emulate in artificial 
neural networks. We start this chapter with a brief section devoted to fixing 
notations. Then we discuss some Information Theory results, mainly in the 
framework of Kolmogorov Complexity [Kolmogorovy, 1965], to give direction to 
the chapter’s statistical issues. Next we will consider two risk functions gener- 
ally employed for addressing the solution of our learning task, related strategies 
to get a minimum for them, and theoretical tools to appreciate and improve 
the value of the learnt hypotheses. In this way we conclude our journey from 
the most comfortable situation where we know almost everything about our 
operational framework (except for some few parameters to be estimated) to the 
very primordial one where we know nothing but a set of examples on how the 
framework interacted with us. We could consider the scenario where the first 
man, Adam, started learning what was edible in order to survive. He had only 
his brain as a tool for examining any hypothesis on the matter, plus a set of 
sensory devices giving him signals about edibility. Adam solved his problem in 
a wonderful way, and we are the test set of this. Through a similar discourse we 
neither exhaust the broad subject of neural networks nor provide a firm solution 
to the actual problem of automatically eliminating or delaying the reading of 
unsolicited messages. However we do supply a general perspective that should 
enable the reader to appreciate the suitability of most subsymbolic learning al- 
gorithms available in the literature, versus his own learning problem, and thus 


Late directions 
rom , 

computational 

learning theory 


Hints from 
information 
theory 


Request for a 
direct check of 
the solution 


A task for 
biological neural 
networks as well 


hence the root of 
our brain’s 
success. 


A lot of 
elementary 
computing 

elements 


highly 
interconnected 
an 


open to changing 
their behavior 


to comply with 
external stimula. 


Either symbolic 
or subsymbolic 
processing 
elements 


A grid of PEs 


244 Subsymbolic learning 


get a clear idea of the statistical rationale underlying them. 


5.1 A very essential taxonomy of neural networks 


Imagine a melting pot where a lot of people are more or less stably grouped in 
communities that exchange signals with one another. The effect of the signals on 
the incoming cluster is to change both the output and the cluster structure per 
se. You may also image that a similar phenomenon occurs within the community 
at various levels of granularity. From the outside you can decipher the behaviors 


d of some fine grain groups. Thus you may say, for instance: this group imports 


leather and produce gloves, that group collects books and develops new ideas. 
For other groups it is harder to grasp what they are actually doing. So, you may 
say a certain group is involved in politics yet still wonder what the people do. 
Groups of children surely fall in this category: you say they are children, but 
what they will end up doing it will be seen on the future, somehow depending 
on the occasions they will meet, much more it will depend on the education they 
will receive. Aiming to interact effectively with this society, you will associate 
a specific function to each of former groups; meanwhile you consider the latter 
one (children’s) as random functions, in the randomness notion we developed 
in Sec. 1.3.6 and try to modify them to your senting. In a more gentle notation 
we say you teach each of the latter communities how to behave according to a 
functionality of interest to you. And since teaching is a hard job, no matter 
what people say, you force them to learn your commodity by themselves by 
obeying a behavior protocol we call learning algorithm simply. 

Don’t worry, at present we assume that a similar social phenomenon concerns 
only silicon molecules. So, represent the melting pot as a lot of processing 
elements (PE) variously interconnected. Denote by a square those that stably 
compute a function, call them symbolic PE (usual processors, like a PC we may 
buy at a store) and by a circle the other processors. We will say that the latter 
perform a subsymbolic computation, just because we cannot describe in an 
analytic (i.e., symbolic) way the function they compute. Call them subsymbolic 
PE. The life of this tricky computing machine is the following: start with an 
initial set of messages flowing through the connections. Then on the basis 
of current incoming signals, each processor individually computes its output, 
which it sends to the connected companions. To give some order to this chaotic 
processing mode, with reference to Fig. 5.1 we fix a terminology and some 
specifications without diminishing in principle its generality. 


e Neural network: a graph constituted of v PE’s (either symbolic or sub- 
symbolic) connected by oriented arcs ?. The graph is not necessarily fully 
connected, in the sense that it may lack a connection between some pairs 
of PE’s. The PE’s may be either symbolic or subsymbolic. In case both 
kind of processors appear in the network, we generally speak of hybrid 
systems. 


?Thus the arc connecting PE i to PE j is different from the one connecting PE j to PE i. 


A very essential taxonomy of neural networks 245 


Fig. 5.1: The general architecture of a neural network. Squares: symbolic PE’s; circles: 
subsymbolic PE’s. y; denotes the threshold of PE i. Analogously, wi; and wji denote 
the weight of the connection from PE j to PE i and vice versa. 


e State T = (™,...,7,): the current ordered set of messages in output of computing a 
A state vector 
each processor, where each message 7; takes values in a set = C R. We 
introduce the order to identify the messages of the single PE’s, thus we 
refer to a state vector. If we want to specify a time dependence we will 
denote the state with T(t), and a sub-state related to PEs’ location L 
(capital letter) with 7”. 


e Free parameters (w,y): consist of the weight vector? w = depending on 
(wi1,---,Wij,--.,Wyv) E Dw where wij is associated to the connection free parameters 
from processor j to processor i (w;; = 0 if no connection exists from j to 
i), and the inner parameters vector y = (y1, .-., Yv), where 7; is associ- 
ated to processor i. Depending on its use, typically according to (5.2) later 
on, when no ambiguity occurs we will refer to the sole vector w adding a 
dummy PE piping a signal constantly equal to 1 to the i-th PE through 
a connection with weight set to yi. 


e Activation function vector: the ordered set of functions h; : 2” = $ affecting the 
computing a new state 7; = h;(T) of the i-th PE in function of the current fangtion d by the 
state vector (as a component of the hypothesis h computed by the whole P®'s 


network). 


e Activation mode vector: the synchronization order of the single processors. with different 
ae f s j 4 ‘i imings. 
We may distinguish for instance between the following activation modes: 


3The connection weights are often modeled through a matrix W whose generic element is 
wij. As in the related formulas we will never use matrix forms, we describe weight connections 
through the vector w which can be thought as a vector rearrangement of the rows in W. 


An algorithm for 


molding the free 
parameters 


on the basis of 
given examples 


checked on new 
examples 


in terms of 
performance of 
the molded 
network. 


Neurons 
Connections 


Weights and 
thresholds 


A linear function 


A step function, 


246 


Subsymbolic learning 


1. parallel: the PE updates its state at the same time as the other PE’s 
synchronized with it. 


2. asynchronous: the PE updates its state according to an inner clock. 


3. random: the i-th PE tosses a die with as many faces as there are 
randomly activated processors in the network. It renews its state 
when the die outcomes exactly i). 


4. delayed: the PE updates its state a given time after the updating of 
the afferent PE’s. In particular, instantaneous mode means a delay 
equal to 0. 


e Training algorithm: an algorithm modifying the free parameters according 
to some utility function. In a broad sense we may consider as a free 
parameter the architecture of the network, too, which may be modified by 
some training algorithm. 


e Training set: the set of examples in input to the training algorithm. 


e Test set: a set of examples, not considered for training the network, but 
used to check how the trained network behaves. Some training algorithms 
use this set for tuning the parameter modifications and a third set, called 
validation set, for the above check. 


e Generalization: the network’s attitude of behaving well on the test set. 
In the case of subsymbolic processors the notation specifies as follows: 

e PE — neuron; 

e arc — connection; 

e free parameters: wij — connection weight, yi — threshold; 


e activation function: h;(T) = o(a;(7)), where 


aj(T) = 5 WijTj + Vi (5.2) 
g=1 


is the net input to PE i. Hiding the neuron index, the most common 
expressions of ø are the following: 
1. the simplest one is a linear function (see Fig. 5.2(a)): 
o(a) = Ba (5.3) 
with 8 € R; 
2. the primary nonlinear one is the Heaviside function (see Fig. 5.2(b)): 


gla) = i 


which smooths in two directions described in the following two points; 


lifa>0 


| (5.4) 
0 otherwise 


A very essential taxonomy of neural networks 247 


Fig. 5.2: Typical shapes of the activation function ø: (a) linear as in (5.3); (b) non- 
linear, non continuous as in (5.4); (c) probabilistic, as in (5.5); and (d) nonlinear, 
continuous, as in (5.6). @ constitutes a curve parameter. 


3. the primary probabilistic one (see Fig. 5.2(c)) is described as follows its probabilistic 
smoothing, 


1 1 
P(o(a) = 1) = Ipea P(o(a) = 0) = Ipe (5.5) 
with 8 € R*, which smooths function (5.4) in terms of random 
events, coinciding with the original function for 8 = +00 +. Hence the 
meaning of ( is the inverse of a temperature 0 of a thermodynamic 


process determining the value of r [Amit et al., 1985]; 


4. the primary continuous one (see Fig. 5.2(d)) is the so-called sigmoid its continuous 
smoothing. 
function: 


1 


[pe (5:6) 


o(a) 


with an analogous smoothing effect, o(a) in (5.6) being the expected 
value of o(a) in (5.5). 


4Hence P(a(a) = 1) + P(a(a) = 0) = 1. 


Quite different 
rom a Turing 
Machine 


in the way of 
computing the 
output 


or even in the 
way of feeding 
input and 
observing output. 


248 Subsymbolic learning 


Remark 5.1. With reference to the description of a Turing Machine (reported in 
Appendix D) as a template of commonly used computer devices like our PC, a 
neural network is a computing device with the following distinguishing features: 


1. mass storage: there are no tapes where the records are explicitly stored, 
rather they are synthesized into the connection weights; 


2. control unit: it is ruled by a transition function ô : ẹ x T —> oe , hence a 
possibly parallel computing function on the set of states modulated by a 
set [ of parameters. S may coincide with the space of either states 7, with 
T a set of weights, or weights w with I a set of states, or both. The set 
of 6, actually of algorithms, running on this control unit is very limited: 
a few learning or optimization algorithms plus slight variants; 


3. halt states: absent. Possibly the machine evolves into a stationary state, 
and we say it halts. Possibly, it may keep on running; 


4. data: a subset of the state vector log. 
E 


If the connections’ grid does not contain a loop the graph shows an orien- 
tation (direct acyclic graph, see Fig. 5.6). Thus we may interpret the nodes 
without incoming arcs as input nodes and those without outcoming nodes as 
output nodes. We fix the state T? of the former, then wait for all the nodes 
to have updated their states after this external solicitation (we may figure this 
as a propagation of the input through the network). We consider the states of 
the output nodes as the output T? of the function h computed by the network. 
PE’s in this kind of architecture are typically arranged in ordered layers: in 
particular the first and last layers contain, respectively, the sets of input and 
output PE’s. Many models introduce the additional requirement that each PE 
in a layer be exclusively connected with every PE in the subsequent one. The 
resulting system is called a feed-forward network. 

If some loop exists, we speak of recurrent neural network that generally never 
stops updating its state. This is because — with any activation mode — from time 
to time a node within the loop receives a new input, and computes a new output 
that propagates in the network through the connection graph. It may happen 
from some time on the new input coincides with the previous one. If this occurs 
for each node in the loop, then we say that the network has reached a stationary 
state. We assume this state to be the output T° for T? coinciding with the state 
T with which the network started its evolution. Otherwise we must explicitly 
consider an evolution time along which the processors update their state on the 
basis of the mentioned updating modes, picking the output as necessary from 
some special PE’s. 

A standard way of treating the matter is to associate a time slot to each pip- 
ing of a state along a connection. Instead the processing of the new state, i.e. 
the computation of g, does not consume time. A direct acyclic neural network 


A very essential taxonomy of neural networks 249 


o Bm 


ae 


(a) (b) 


Fig. 5.3: Two representations of a hybrid system: (a) running network and (b) direct 
acyclic mother graph. Black arrows: running connections; plain gray arrows: current 
unfolded connections; dashed gray arrows: past or future unfolded connections. 


reaches a stationary state in each PE after the input has been feed-forwarded to 
it. We artificially reproduce this stationarity condition by unfolding a network 
with cycles in an unlimited mother graph (see Fig. 5.3), where the transfer of 
information from one PE to another (possibly the same but at different times) 
takes one edge. As they refer to different sites in the graph, connections existing 
at a given time between PE’s might disappear at other times, depending on the 
original updating mode. The functions as well computed by the PE’s might 
change from time to time. Starting from the bottom of the graph at time 0, we 
proceed in parallel with all connections moving from this level. Thus we identify 
a level/time 1, and so on. Of course we may get rid of this infinite depth graph 
if we are able to algorithmically describe the generic evolution step: 


T(t+ 1) = g(r (C)) (5.7) 


We will consider picking outputs of this network in any relevant part of the state 
log. 
The mother graph model recovers nearly all recurrency modes. 


Example 5.2. Consider the two elementary networks in Fig. 5.4. The left net- 
work is operated in the standard way (one connection = one time slot delay) 
described before. The right one assumes that the signal flows with no time 
consumption (delay 0) along the axis from input J to output neuron O; while 
covering the self connection takes one time slot. However the signal manage- 
ment in the two cases is the same. In both cases the intermediate node takes as 


In any case we 
map on an 
acyclic mother 
graph in case of 
the neurons’ 
deterministic 
activation 


and to a Markov 
chain in case of 
random 
activation. 


The three belts 
metaphor 


250 Subsymbolic learning 


5 
9: 


et 


1 1 


oO 


(a) (b) 


Fig. 5.4: Equivalence between network dynamics. (a) standard network with delay 
1 on each edge; (b) equivalent network with non homogeneous delays. In both cases 
the intermediate node takes as input the signal produced by itself one time slot before 
receiving signal from the input node. 


input the last available signal r? from J (with one or no any time slot delay in 
the right and left network, respectively) plus the signal produced by the node 
itself one time slot before receiving 77. 


In the case of random activation functions, stationarity concerns the state distri- 
bution law. Thus we generally look for neural networks whose state is a random 
vector T € ¥ = {0,1}” whose distribution law converges to a fixed distribution 
representing the actual output of the network. In the case of a network made up 
only of subsymbolic PE’s activated according to (5.5) — which is usually referred 
to as a Boltzmann Machine (BM for short) — we have that the state vector is 
Boolean and its distribution law at time t depends on the log of its evolution, 
only through the specification of T at time t—1, i.e. before the neurons’ update. 
This figures the network evolution as a Markov chain®, getting stationary con- 
ditions (through detailed balance satisfaction (C.7)) in the case of asynchronous 
activation when the connections are symmetric, i.e. when wij = Wji, Vi, j, with 
the diagonal terms w;; = 0 for each 7 and no conditions on the threshold vector. 
Other stationarity conditions may also be found, for instance when the activa- 
tion mode is parallel. Moreover, the current state distribution law at time t in 
the case of asymmetric connection weights may be of some interest. 

We can sketch the matter as in Fig. 5.5, where the three distinguished clus- 
ters of PE’s are so characterized: 


e thermal core: provides dynamic to the machine state; 


5see Appendix C. 


Compression of a set of data 251 


d isible be D 


mba soke 


enviro. 


Fig. 5.5: Thermodynamical view of a Boltzmann Machine. 


e intermediate belt: gives structure to the above dynamic; 
e visible belt: simulates the enviromental distribution law. 


Like in any sampling mechanism, the thermal core provides a seed random 
variable U that, for a proper @ > 0, a sufficient number of neurons and non 
biased free parameters, spans with almost equal probability the vertices of a 
-dimensional hypercube, where u is the number of neurons on the borderline 
with the intermediate belt. The latter represents the explaining function of the 
random binary vector TY produced on the visible belt. It does not reproduce 
a function c. Rather, we expect a training algorithm to mold the connection 
structure of the intermediate belt enabling it to produce a TY population al- 
most undistinguishable from a suffix of a sample {7T}, ..., TY, } supplied to the 
algorithm as a training set. 


5.2 Compression of a set of data 


The first conceptual task teachers ask us to perform in elementary school is to 
summarize a story. They judge our work on the basis of: i) our grasp of the 
story’s content and thus presentation of it (as requested, always using our own 
words); and ii) our succinctness in writing the summary, all the while including 
main topics of the story. Once, like a child, we have no formal tool for framing 
the examples, our learning algorithm adheres to the same type of rules. Data 
compression is a relevant task in telecommunication, and there are a lot of meth- 
ods for giving back integrally or almost integrally at one end of the transmission 
channel what put in at the other. Most of them are based on the entropic cri- 
terion of minimizing the mutual information at both ends of the channel. Data 
compression for learning is requested for the same ability during the training 
phase, with the further value however of processing extended code-books, thus 
producing new codewords, during the generalization phase. We may grasp this 
better in the Kolmogorov complexity framework. As mentioned in the intro- 


Again an inverse 
transform to 
produce random 
variables 


Learning as a 
compression task 


i.e. a maximum 
likelihood 
estimate 


based on 
Kolmogorov 
complexity. 


The most concise 
description of a 
string 


without 
ambiguity on the 
end, 


neither too long 


252 Subsymbolic learning 


duction, the key problem is to find sufficient statistics. In Sec. 2.3.2.3 we saw 
that a powerful way to do this is by minimizing the relative entropy between 
sample empirical and asymptotic distribution law, which results in searching 
for maximum likelihood estimators of the distribution law’s unknown param- 
eters. In complete unawareness of this law, compressed data play the role of 
reasonable approximations of the above estimators. Indeed, consider the utmost 
compression of a set of data, as defined by Kolmogorov, as the description of 
the shortest program p that in input to a general purpose computing machine 
outputs exactly the data [Kolmogorov, 1965]. The following definition is a slight 
improvement of this notion based on prefix functions. 


Definition 5.1. [Levin, 1974] Let ¥ be the set of all binary strings and |x| the 
length of the string x. Denote with g(a) < +00 the fact that a partial function 
g is defined on x. A partial recursive function? (prf) 6: ¥ > ¥ is called prefix 
if d(x) < +00 and (y) < +00 implies that x is not a proper prefix of y. If we 
fix a universal prefix prf Y 7, and denote X (p, y) the output of machine Y fed 
by program p having y as input, the conditional prefix (or Levin’s) complexity 
K (aly) of x given y is defined as 


K(a2\y) = ome such that X (p, y) = x} (5.8) 
pe 


and the unconditional prefix complexity K(x) of x as K(x) = K (ale), where e 
is the empty string. 


Remark 5.2. For short, minding ¢ a function for coding words, to be @ prefix 
means that you cannot have two codewords having a same prefix. As a matter of 
fact, every programmer is used to dealing with prefix functions, as he delimits a 
code section with particular instructions (such as Begin and End, or a balanced 
pair of curl brackets). 


This function is a very strong tool for characterizing the inner structure of a 
sequence of data between the two extremes of completely deterministic strings 
of all 0’s or 1’s and completely random strings unable to find a more concise 
description than the bits’ listing. The main assertion in this respect is the 
following fact. 


Fact 5.1. The prefix complexity of a string x is not greater than |x| plus a 
constant depending on the computing machine (actually we can always consider 
the program: “output z”). 


6see Appendix D. 
Tie. a machinery capable of computing any computable function according to the Church- 
Turing thesis (see footnote 1 in Chapter 1). 


Compression of a set of data 253 


At least half of the strings of a given length n have a prefix complexity greater 
than n, being therefore qualified random (actually we must consider that in 2” 
strings of length n we cannot fit more than 2”~' prefix codes, where at least 1 
bit long instruction is devoted to determining whether the current bit belongs to 
a command or to a text). 


As probability is an effective way of summarizing properties emerging from 
observed data, and prefix complexity is a strong way of identifying compu- 
tational structures among data, the following connection between both is not 
surprising (though highly quantitatively precise). 


Lemma 5.1. The probability measure P of any string x € X explained by the 
function gg is related to the prefix complexity K of x and g (i.e. the description 
of go without specifying 0) through the following equation: 


P(x) < 2-*@)9QK(9) (5.9) 


Proof. (Sketch) The lemma comes from the fact that we can use the prefix 
binary code mp(z) of the quantity P(x) as a code of « itself in a prefix machinery 
having the description of g in its library; and this machinery can be simulated 
by a universal prefix machinery Y by running a code of length K (g). Thus 
a sequence of length |Tp(s)| + K (g) can be used to code x in the reference 
machinery Y of Definition 5.1. Of course, in respect to this machinery the 
shortest code of x has a length K(x) no greater than the above, and |pi,)| is 
no greater than — log(P(x)) 8. Hence, K(x) < —log(P(x)) + K(g). O 


Remark 5.3. Note that the strings “000...0...0” (0 repeated 10,000 times) 
and a string of 10,000 bits without any apparent recurrence rule such as 
“0110100110100...” have the same probability when the bits are independent 
and P(1) = 4 or even a greater probability the former if P(1) = 0.1. This 
is true even if we may assume for the former the existence of a code shorter 
than for the latter. Actually the above probabilities are disregardable in any 
case. Instead, with the above lemma we want to distinguish possibly high (very 
close to 1) probabilities from low (very close to 0) ones. This may occur for 
instance with the probability of getting a solution, given an instance of a highly 
complex problem such as CONSISTENCY (see Sec. 3.2), for which we have an 


approximate solving algorithm. 
a 


Though both K(x) and K(gg) are not generally computable by definition? 
[Solomonoff, 1964], we will use the upper bound in (5.9) as a maximum like- 
lihood estimate (2.34) of P(x). Therefore, our problem of finding a consistent 
statistic approximately coincides with the one of reading the upper bound on 


8 where log denotes the logarithm to base 2. 

9This negative result is a variant of the well-known Turing Machine halting lemma [Roger, 
1967] that says there is no program 7 which, on input any program 7 and related data, 
computes 1 if m halts after any number of steps or 0 if it gets stuck in a loop. 


nor too short. 


254 Subsymbolic learning 


the probability of a sample and identifying the function of the sampled data on 
which the upper bound depends. Disregarding for a moment the term 2% (99), 
we isolate within the other upper bound factor the part we assume to be inde- 
pendent of the unknown aspects of the population (synthetized by parameter 0) 
from the part depending on them in terms of its distribution law (the wanted 
statistic). Of course, samples with this same statistic have the same probability, 
apart from coefficients independent of 0. Hence this statistic proves sufficient. 
Therefore a second approximation consists in writing the second member of 
(5.9) in such a readable form. As mentioned before, we cannot write the mini- 
Efficient in place mal code @ underlying K(x). We can however look for very efficient programs 


of optimal : : . 
compression 7 as estimates ¢@ of ¢, which we split as follows: 


Px = Tg, (Tr(a); 3 Tinlad) Wi eeccectien IH) (5.10) 


for suitable g, where 74x, ,....x,,),9 denotes a program that computes the sequence 
t(@1,...,@m),@ and the separator between the two values, and the allotment of 
the computational tasks is aimed at minimizing the total 7s’ length. Namely, 
we recognize in the efficient compression of property t of the sample the sufficient 
statistic relating with the general property 0 of the whole population, while the 
remnant part r of x; must be computed singularly on each variable. While a 
cognitive constraint that generally makes y longer, this strategy is also a useful 
a help in devising it. We can easily recognize in the first argument of 77 the — log 
of the first factor of the likelihood factorization (2.34) when a sufficient statistic 

exists: 
P(x) = fi(a1,...,Um)fo(t(21,.--,;Lm), 4) (5.11) 


Here we further split f2 in the m2, ,..,2,,),0 and K(g) contributions. This 
allows a balancing of description complexities of statistics, constraints, and 
residual unknown parts of a sample (which looks for an enlarged issue of the 
structural risk minimization task introduced in [Vapnik, 1995]). These are rep- 
resented by K(t,0), K(g) and {K(h(a;))}, in an approximated issue of (5.9) 
that reads 


P(x) = 27 Eta Rhed) -R 0em) 0) KG) QK (9) (5.12) 


Here 29) is a sort of rewarding factor privileging the socialization attitude 
of a distribution, i.e. the attitude of lowering its descriptional complexity (up 
to K(g)) moving from a single to a set of variables. 

commit ain However neither the actual @ nor the true complexity value is known. Thus, 
search far a short splitting K (t, 0) in K(t)+ K(0) the maximum likelihood principle requires us to 
minimizing the give a very short global description of the sample by minimizing the total length 


description - 
independently of of T as in (5.10). 
0 


5.3 In search of a sufficient statistic 


As implemented in the previous section, the maximum likelihood principle, 
leaves us with two problems people generally acquaint with separate though 


255 


In search of a sufficient statistic 


Fig. 5.6: A feed-forward neural network with one hidden layer, and parameters as in 
Theorem 5.1. 


intertwining steps: optimizing the architecture of the neural network (with a 
general commitment to have it be describable by a short string); and identifying 
the parameters solving our learning task. 

For instance we may ask for a network that computes the set of the sec- 
ond term of each example having in input the first term. Actually, there are 
many theorems which ensure the network’s ability to compute similar function, 
whatever it is. Among others, we have the following: 


Theorem 5.1. /Cybenko, 1989] Consider a feed-forward neural network with n 
inputs, N intermediate level nodes endowed with a sigmoidal activation function 
a as in (5.6) and one linearly activated (5.8) output node like in Fig. 5.6. The 
functions h(t!) computed in the output 


(5.13) 


when varying the connection weights 3; from the intermediate to the output nodes 
and the analogous weights and parameters from input to intermediate nodes, are 
dense in the space @(I,) of the continuous functions in the unitary hypercube 
I, of dimension n. This means that for each continuous function c on I, and 
for each £ > 0 there exist an integer N and real constants Pi, wij and yi, such 
that 


max \c(t) — h(7)| < € 


(5.14) 


The question is how large must N be, which reads as a syntactical companion 
of the sample size identification dealt with when the concept class is fixed. Both 
the maximum likelihood principle, in the above approximation, and general 
wisdom push for an N as small as possible, maybe with neurons arranged in 
a different architecture, e.g. into more than one single intermediate layer. A 


Learning: 
identifying the 
architecture and 
its parameters 


The problem is 
feasible 


Pay a lot for 
learning to 
generalize well 
(short neural 
networks) 


Short learning 
algorithm too to 


maximize the 
likelihood 


256 Subsymbolic learning 


Fig. 5.7: Overfitting in polynomial interpolation. Black and gray points: training and 
test sets, respectively, each of the form {(x;, yi)}, where x;’s are drawn uniformly in 
[0,10] and the corresponding y;’s are computed as ax? + ba; +c +e, with a = 0.2, 
b = —1, c= 4 and e denoting a random measurement error uniformly distributed in 
[—0.5, 0.5]. Black and gray curves illustrate respectively the polynomials of degree 10 
and 2 interpolating the black points through a least-squares technique. 


typical folklore goes: “the easier a curve learns (to reproduce) a training set, 
harder it generalizes on the test set”. A typical drawback of a fine fitting of a 
set of points with a curve is overfitting. It occurs when, like in Fig. 5.7, we lose 
the main trend of the data. The curve fitting the training set black bullets in 
the graph best is a polynomial of degree 10. This curve seems however to pursue 
some details we may consider minor in the phenomenon we are studying. Hence 
they may be considered as random noise. In fact the curve ranges far from new 
points (marked in gray) observed by the same phenomenon (e.g. drawn with the 
same distribution law) but not considered for tuning the curve (thus constituting 
a test set). On the contrary the second degree curve denotes almost the same 
accuracy on the whole set of data. From an operational point of view we find 
in the literature both algorithms that progressively enlarge the network layout 
just when the actual configuration does not allow us to fit well the current point 
of a training set (according to a given learning rule) [Carpenter and Grossberg, 
1988], and algorithms that prune a network, once it has been satisfactorily 
trained, until its capability or reproducing the training set remains acceptable 
[Reed, 1993] '°. Many variants of these methods are available. But seldomly 
do they satisfactorily replace the attitude of the human operator in sizing the 
network well after a limited number of trials. 

The second horn of the problem comes to a search for statistics driving 
the inference of the parameters of this network. Still, inequality (5.9) offers a 
methodological suggestion if we consider the probability of the future output 
given the network layout and the training set. In this case the description of the 
future data passes through the description of the learning algorithm. Therefore 


l0pruning strategies are generally based either on the sensitivity of the risk function to 
the single weights or on special terms added to it, penalizing the complexity of the network 
[Hanson and Pratt, 1989]. 


In search of a sufficient statistic 257 


we want the latter to be simple and short, so as to maximize the upper bound 
to the probability of meeting it in the future. This requirement complies with 
the Occam razor principle 1! and with the general aim of imitating the learning 
procedure of the human brain (which we assume to be simple in its building 
blocks, so simple that even an infant can implement it). Finally, the concept 
class C we want to learn is syntactically well defined as a family of neural 
networks with a set of free parameters, yet semantically unknown. Thus we 
have no tool for describing the distribution law of its elements that explain 
suffixes of the observed sample. So, we may consider the distribution law of w 
unattainable, and focus directly on its point estimation, hence of a hypothesis 
h, by minimizing some simple risk functions. 
The most useful one is still the MSE, which in this case reads: 


ECT? = A(T") (T° — AT" ))) (5.15) 


where T°? denotes the target output we should compute, h('T’) the output we 
actually compute as a function of the state sub-vector we manage as input, 
and a? denotes the transpose of vector a. As mentioned in Example 5.1, this 
quantity depends on the random suffix distribution law in two respects: for any 
suffix specification we take the expectation of the quadratic difference between 
T? and h(T’), where h depends on an observed sample. This expectation is 
still a random variable; hence we take the expectation of this variable over the 
possible suffix specifications. 

If T° — C(T’) is a Gaussian variable denoting a random noise according to 
common regression paradigms, we may typically refer to the statistic s, defined 
as: 


= — Sr? — hlr)" (r? — h(r");) = => SOR, —A(r)eg)? (5.16) 
j=1 j=1 k=1 


where h is the current hypothesis for C and vo is the number of output PE’s, 
while TR; and h(r‘),,; denote the k-th component of T? and h(T});j. Actu- 
ally, most statistical studies on the performance of a neural network’s learning 
algorithms are devoted to justifying how this statistic works well under a set 
of boundary conditions so special as to be inapplicable. In some sense s plays 
in learning the role of general purpose statistic that x played for many years 
in parametric inference. The latter case was due mostly to the absence of an 
efficient computing tool for performing more complex computations. The for- 
mer is due to the lack of available information for reasoning about the MSE 
distribution law. Generally we try to enrich the risk function by adding some 
further constraint on the network parameters. 


Example 5.3. To prevent overfitting, we may ask for a bound on the square 
norm ||w||? = $}; ,_, wj; of the weight vector that acts as a regularization 

lIn its original formulation, this principle states that “Nunquam ponenda est pluralitas 
sine necessitate”, which, approximately translated, means “Entities should not be multiplied 
beyond necessity” [Tornay, 1938]. 


A general 
purpose cost 
function to 
address the 
computation 


Sample square 
error is the 
learning 
companion of the 
sample mean in 
point estimation 


Binding the 
connection 
weights 


is a way of 
minimizing the 
structural risk 


or equivalently 
maximizing their 
conditional 
probability 


A team of advisor 
networks 


If the target is a 
distribution law 


a closeness 
measure of a 
hypothesis is the 
relative entropy. 


258 Subsymbolic learning 


term [Poggio and Girosi, 1990b]. Its role is to smooth the function computed 
by the network. The cost lies in not hitting the training set points perfectly, 
the expected benefit in improving generalization. Thus, with À as a balancing 
factor, the overall risk function reads 


E((T° — h(T’))7 (T° — A(T*))] + Allwl |? (5.17) 


whose minimization is an instance of the structural minimization problem. Note 
that we can arrive at the same risk function i) by starting from a Bayesian ap- 
proach, where, with reference to the Bayes’ formula in Sec. 1.3.3, we want to 
maximize the probability P (W|r) of the free parameters specification given the 
sampled training set, ii) under Gaussianity hypotheses on the a priori distribu- 
tion of W and a posteriori distribution of T given w [MacKay, 2003]. In any 
case the statistic on which the learning algorithm is based remains S. 


If T° is a binary vector, then MSE translates into the frequency of mislabeled 
examples at the basis of the learning theory for Boolean concepts. 

A similar statistic is also at the basis of learning algorithms for the ensemble 
of networks [Jacobs et al., 1991]. 


Example 5.4. The idea underlying ensembles of networks is that T° is taken 
in output of a network which evaluates an analogous value produced by an 
ensemble of neural networks acting as advisors for the former. In this case the 
statistic computes the frequency with which each advisor produced a correct 
output. 


The situation is different when we work with stochastic neural networks. Here 
we assume that c also produces a stochastic output, i.e. we have that, for given 
c, e(7) is a random variable T°. Both 77 and 7° are Boolean vectors and the 
randomness of T? cannot be disregarded as a marginal noise, rather the target of 
the learning task consists precisely in the parameters of the joint distribution law 
of its bits. In this case we usually assume as risk function the Kullback-Leibler 

relative entropy (Kullback distance for short) I(w) between the environment 
distribution law y(7”) and the distribution 7g(7” ) = 1g(T”; w) produced by 
the network on the visible neurons. Its expression is the following: 


Iw) = Ñ 9(r¥)in as 


— 5.18) 
V. ( 
a a(t iw) 

where QY is the set of the visible nodes’ state vectors. If we refer to a stochastic 
network with symmetric connections and random activation mode, and identify 
mg(7”) with the stationary distribution of the Markov chain implemented by 
the network as discussed in Sec. 5.1, the goal of the learning procedure is to 


Learning strategies 259 


identify a suitable temperature and intermediate belt structure (see Fig. 5.5) 
and parameters, where the border line between thermal and intermediate belts 
might drift during the training as a consequence of the weights tuning. After 
the plug in method in Sec. 2.3.2.3, the maximum likelihood estimate I(w) of 
(5.18) is given by: 


(5.19) 


We cannot state that this estimator is a sufficient statistic for any T distri- 
bution law. Two facts however work in favor of its use for pivoting the following 
twisting argument: 


(iw) < T(w)) = (Iw) < 0) (5.20) 


Indeed: 


1. T(w) = s is defined for all w on whatever T”, thus avoiding jumps to 0 of 
the probability of points belonging to a same T contour with w, considered 
in an analogous case of a variable with a fixed range in Example 2.8. 

2. reg ya(T’;w) In ant is a Lyapunov function w.r.t. the Markov 
chain ruled by the transition probabilities induced by (5.5) !?, thus en- 
suring the convergence of 7g(7”;w) toward its stationary distribution 
pe(TY;w). This makes T(w) a consistent estimator of I(w). 


5.4 Learning strategies 


Once a pivot of the twisting argument (5.1) such as s in (5.16) or Tin (5.19) 
has been determined, we must push it toward a minimum in the hope that the 
connected cost function does the same. This represents an optimization task 
falling in the class of NP-hard problems (see Appendix D). However, our brain 
circuits seem used to getting approximate solutions of these targets every day 1°. 
We aim to reproduce this ability on silicon through algorithms relying more on 
our experience and intuition (we call them heuristics) than on theoretical results. 
In the rest of this chapter we will no span the vast literature and general practice 
regarding this matter; rather we will try to capture some relevant statistical 
aspects that will help us to select the right heuristic for the special learning 
task we are dealing with. Let us start with nearly the only strong theorem on 
the matter that ensures the attainability of the learning goal with a very simple 


neural network constituted by a sole PE. 
12i.e. a function of w wich assumes a set of values describing a monotone trajectory with 
the network evolution. 
13 Any time we fit a bag we try to optimally solve a Knapsack problem [Martello and Toth, 
1979] that is well known to be NP-hard. 


Its maximum 
likelihood 
estimator 


is the pivot of a 
twisting 
argument 


defined for all w 
an 


a consistent 
estimator. 


Cost |. 
minimization 
strategies 


The simplest 
neural network 


a single neuron 
implementing a 
linear threshold 

i.e. 


a separator 
hyperplane. 


If the input is too 
low then increase 
the weights 


260 Subsymbolic learning 


5.4.1 The perceptron theorem 


Let us consider the neural network in Fig. 5.8. It consists in a single computing 
neuron that is fed by a set of v; binary signals piped by connections from input 
neurons (whose role is just to send signals), and outputs either 0 or 1 according 
to the Heaviside function (5.4). 

This network, which we call perceptron, computes an indicator function la- 
beling with 1 the points x = T7 having a positive distance h(x) from the hy- 
perplane described by a constant yọ and coefficients wo; (being 0 the index of 
the output neuron) — let us call them positive points — and with 0 those having 
negative distance — negative points (see Fig. 5.9). Indeed 


h(x) = 5 WojTj + Yo (5.21) 


The following theorem states that we can easily adapt the perceptron’s free 
parameters to divide the unitary hypercube according to any hyperplane. More 
precisely, identifying concepts and hypotheses with both the indicator functions 
and the underlying hyperplanes, if a hyperplane c divides a set of points x into 
positive points (those belonging to a same hypecube partition determined by 
it) and negative points (the remaining ones) but we have no record of it apart 
from the labels a(c(x)) of the points, with little computational effort we can 
build another hyperplane h equivalent to the former. By equivalent, we mean it 
produces the same division of points. All we must do is run the PerceLearning 
Algorithm 5.1. 


Algorithm 5.1 The PerceLearning algorithm. 
PerceLearning(& = {(x),0(c(x1))),.-.,(Xn,a(C(Xn))),.-. }) 
{ 
fix w € R”*!; 
set h(x) = at WjTj 5 
td; 
while (error > 0) 


randomly pick (x;,o(c(x;))) € 6; 

if o(c(x;)) =1 and o(h(x;)) =0 
w=wt+,; 

if o(c(x;)) =0 and o(h(x;)) =1 

i= i4 l; . } 


Namely, for whatever starting values we give the parameters, we must change 
these values according to the following simple rule: pick a point x in the sample 
and compute a label o(h(x)) for it according to the current values of the free 


Learning strategies 261 


Fig. 5.8: Architecture of a perceptron having inputs 71,...,2v,;,%v,+1 (the latter be- 
having as a threshold for the output unit), activation function ø as in (5.4) anda = h 
as in (5.21). 


Fig. 5.9: Partition of the (projected) vertices of a hypercube on [0,1]” in the two 
half-spaces generated by a separating hyperplane. Positive and negative points are 
marked, respectively, with + and —. 


If the input is too 
much high then 


decrease the 
weights. 


With this rule we 
learn a 
perceptron in a 
limited number of 
weight updates 


Mislabeling 
frequency is a 
sufficient statistic 


Consistency 
problem for a 
perceptron is in 
class P 


We need a 
polynomial 
sample to get a 
sufficient number 
of updates 


262 


Subsymbolic learning 


parameters, where o is computed according to (5.4). If it coincides with the 
original label o(c(x)) then do nothing. Otherwise, if o(c(x)) = 1 and a(h(x)) 
0 (denoting too low an h(x)) increase w, vice versa decrease it. If for the 
sake of uniformity we include yo in the vector w (by adding a dummy input 
neuron piping a signal equal to 1 through a connection with weight exactly equal 
to yo according to Sec. 5.1), we obtain these changes on w simply by adding 
or subtracting the augmented input (x,1) to/from the augmented parameter 
(wo, 70). 


Theorem 5.2. /Minsky and Papert, 1988] 


e For any hyperplane described by the equation c(x) = Dia bjt; + bo = 0 


dividing the discrete v;-dimensional hypercube ¥ into the half-spaces +: 


B? {x such that c(x) < 0} (5.22) 
B! {x such that c(x) > 0} (5.23) 


e if you pick points either from B? or Bt in any proportion, 


e the learning task: compute wo and yo such that h(x) = Dea Wojzj + Yo 
and o(h(x)) = o(c(x))Yx € ¥, is solved by the PerceLearning Algorithm 
5.1 after a number of weight updates fewer than —+ where u(vi) is 


(vi)? ? 
the hyperplane optimal margin (see Sec. 3.6), i.e. the minimal distance 
between the points in ¥ and the separator hyperplane which maximizes 


this distance. 


Based mainly on the discreteness of the separated points, hence on the ap- 
proximation with which the hyperplane coefficients must be identified, see Fig. 
5.1015, the proof of the theorem is simple enough. The reader may find it in any 
elementary book on neural networks. Its statement, which fits with other results 
we have already seen, says that the frequency of mislabeled points is a sufficient 
statistic. We have already found this in Sec. 3.1.5. In Sec. 3.2 we remarked 
that computing a Boolean hypothesis consistent with a sample is an NP-hard 
problem. According to Theorem 5.2, however, the CONSISTENCY problem 
restricted to the class of hyperplanes admits solution in polynomial time, since 
L, aS a measure of how points of the unitary hypercube crowd together when 
are projected on a section perpendicular to the hyperplane, is polynomial in v;i. 
We saw a similar result in our discussion of support vector machines. Actually, 
the running time seems independent of the accuracy parameter € and 6 with 
which we want to learn. Still worse, the algorithm learns with ¢ and 6 equal 
to 0. This apparent discrepancy with the mentioned result vanishes however 
when we consider that the upper bound in Theorem 5.2 concerns the number 
of updates and not the number of examples. Thus to come to the algorithm 


14Working with the hypercube vertices, hence with discrete state vectors, hence with discrete 
weights as well (see the subsequent footnote), we may always move from a hyperplane crossing 
some vertices to another one non containing any vertex, yet with same sets B° and B!. With 
the latter we have only hard thresholds, in the sense that c(x) is either > 0 or < 0. 

15as a general result we need at most nì log(n) bits to describe with sufficient accuracy a 
network of n binary state neurons. 


Learning strategies 263 


Fig. 5.10: A set of different hyperplane projections (thin lines) performing the same 
separation done by hyperplane in Fig. 5.9. 


complexity in the usual terms, we must consider how many examples we must 
observe before meeting a polynomial number of updates. From Example 3.2, 
the detail of a hyperplane passing through the origin of the cartesian axes of a 
v,-dimensional space is v;. So from Corollary 3.3 we deduce that this number 
is polynomial in vi, + and + 


5.4.2 Minimization methods 


Since minimizing the frequency of mislabeled points represents the Boolean ver- 
sion of the task of minimizing s as in (5.16), the strong result of the above 
theorem validates this pivot for twisting the generalization capability. We could 
denote the PerceLearning algorithm as an efficient way of both incremen- 
tally computing the statistic and finding its descent direction in the parameter 
space 1ĉ. As the complexity of the learning task increases, these two tasks are 
dealt with in many ways. In this regard, the PerceLearning algorithm falls in 
the family of punishment training procedures: we modify the behavior of our 
network since it failed in computing an output. Quite a different strategy is 
based on rewards: we reward our network each time it produces an output we 
expect. This second strategy requires the network to generate a correct out- 
put by itself, without teacher intervention. This intervention occurs only once 
the network has produced it. This may happen when the network is activated 
randomly in such a way that the full variety of output is produced with some 
probability. We work to render high the probability of correct outputs, for 
instance by minimizing a Kullback distance as in (5.18). 

Reward and punishment are ways of finding the descent direction of the cost 
function we want to minimize. A suitable tuning of these interventions makes 
the algorithm efficient. However, since we assume to know almost nothing about 
the landscape of the cost function, we are drawn to use very rough tuning al- 
gorithms. Their great value lies in the implementation easiness. We go from 


16thus representing an incremental way of implementing a support vector machine. 


Either punish the 
network when it 
is wrong 


or reward it when 
it is correct. 


In any case with 
rough methods 


possibly affectin 
one parameter a 
a time. 


A random walk in 
the parameter 
space 


Each candidate 
step is accepted 
with a probability 
depending on its 
local benefit 


The walk 
converges around 
minimal cost sites 
with a stationary 

distribution 


264 Subsymbolic learning 


reinforcement learning [Sutton and Barto, 1998] algorithms, where both the cor- 
rectness evaluation of T? and the reward/punishment action entail heuristics 
within various disciplines linked to the learning task, to general purpose algo- 
rithms which act independently on one parameter at a time. We skip the former 
since they are beyond the scope of this book. Instead, still trying to focus on 
their statistical relevance, we discuss somewhat three of the latter strategies. 


5.4.2.1 The Metropolis search 


This is a typical stochastic optimization algorithm. The main difference in 
comparison to inference is that now we generate by ourselves the sample from 
which to take statistics. We have already seen in Sec. 1.3.6 how we can generate 
a set of independent random variables specifications. Here we look for a set 
of dependent variables (each constituted by the coordinates of a point in the 
parameters space), with the aim of improving the probability that the point 
comes close to a solution of our cost minimization problem. Thus we generate a 
history of candidate solutions from which to pick the one with the lowest cost. In 
this framework a Metropolis algorithm [Aarts and Korst, 1989] is a Markov chain 
on the set {w;} of the possible assignments to the runtime parameter vector w(t) 
inducing the mentioned dependence between the visited points during the trip 
towards the minimum 17. We simulate it through a series of random changes 
on w(t) paired with a series of Bernoulli variables determining whether the 
computed changes must be accepted or rejected. The related pseudocode is in 
Algorithm 5.2. The parameters vector is solicited by a limited change of its 
components, so that w(t) randomly moves from a point w; to another w; in its 
neighborhood Neig(w;) in the parameter space. The probability of accepting 
a change is a function g of the difference A = r(w,) — r(w;) between the risk 
function specifications in correspondence to the candidate (after the change) 
and current parameters vectors. Two typical g’s are: 


1ifA<0 
A) = z 5.24 
(A) e? ee 
1 
g(A) = ee (5.25) 


As usual in incremental methods, we have no a priori rule for halting the 
algorithm. Rather we must fix a stopping rule by ourselves, e.g. based on some 
frustration phenomenon, such as the fact that r remains constant for a long 
number of iterations. 

This algorithm has two main benefits: 


e It moves the distribution law 7g of w, where 8 € Rt, toward a stationary 
distribution ¢g having the highest probability masses in points where the 


17 Actually, in derogating from definitions in Appendix C, the sites visited by this process 
may constitute a continuous set. 


Learning strategies 265 


Algorithm 5.2 LocalStochasticOptimization (L.S.O.) algorithm. 


LocalStochasticOptimization(w,r, g) 


{ 


Define Neig(w) = the elements’ set neighboring w; 


Fix w(0); 
t=; 
Repeat { 
t=t+1; 
Perturb: 
w; = random(Neig(w(t)); 
Accept: 


P(Accept) = g(r(w;) — r(w(t))) 
If (Accept=TRUE) then 
Update: 
w(t) = w; 
Until Stopping rule = TRUE; 


} 


cost function r(w) is minimal, indeed 


a(w) = - (5.26) 


where zg is a normalization constant. Furthermore, it does so monotoni- 
cally, in the sense that the relative entropy 


I(¢,78) = >> ¢e(w)n 


weDw 


oa(w) (5.27) 
6 


Te(w) 


never increases along the moves’ history (see footnote 12 for an analogous 
property). 


e It pushes the network toward the stationary distribution according to a 
neighborhood definition that is up to us, for instance changing one weight 
at a time, a facility that lowers the move computational load. 


The shape of ¢g(w) suggests that the probability preponderance of minimal 
cost points increases with 8 (i.e. the greater @ the higher the probability pre- 
ponderance). But high values of this parameter render the moves toward more 
costly points practically impossible. As a consequence, r gets stuck w in a lo- 
cal minimum!*. To both speed up the convergence and bypass the mentioned 
drawback, a proper management of 7 is dealt with in the family of Simulated 


18This is the typical plague of greedy algorithms, which at each move uniquely choose a 
descent direction; in the next section we will discuss this problem in greater detail. 


A suitable 
schedule of the 
process 
temperature 
speeds up the 
convergence 


Learning an XOR 


Steepest descent 


266 Subsymbolic learning 


Table 5.1: The XOR truth table. 


Fig. 5.11: We cannot separate the XOR function when using lines as decision surfaces. 
Black and gray bullets denote, respectively, positive and negative points. It is evident 
from the figure that none of the four lines which correctly separates the points of one 
class can do the same with the others. 


Annealing algorithms [Aarts and Korst, 1989]. If we maintain the evocative 
metaphor that 0 = 4 is the temperature of a thermodynamical process giving 
motion to w, a Simulated Annealing algorithm is a specification of the number 
of w moves at a given temperature and the rate of decreasing of it toward 0 


once these moves have been accomplished. This we call the cooling schedule. 


Example 5.5. Assume you want to learn the Boolean function underlying the 
truth table described in Table 5.1. Actually we already know that this table 
describes the XOR function 


XOR(v1, v2) = (v1 A U2) V (U1 A v2) (5.28) 


This function cannot be computed by a single perceptron, since the positive 
points cannot be discriminated through a line in R? (see Fig. 5.11). Nevertheless 
a multilayer perceptron with state in [0, 1]” can do it in a few learning iterations, 
provided you interpret as 1 any outcome of the output node greater than 0.5, 
0 otherwise. We opt for a 2-2-1 three-layer perceptron with two inputs and one 
output nodes due to the function input and output, and two hidden nodes to get 


Learning strategies 267 


4M 

1.12 

8.94 

| A 
\ 


8.28 i 


oe a onea ne = Cycles 


“RM 20 aA BIG 


Fig. 5.12: Course of the error s versus learning cycles when learning the XOR function 
through Metropolis algorithm. 


a minimal structure able to compute the XOR function. Each neuron is ruled 

by a sigmoidal activation function, because we want a continuous dependence 

of the output on the parameters. Finally we optioned the risk function (5.15), 

hence the statistic s = + 7" | (a5, — h(w1,,22,))?. Here m is the number of of a quadratic 
examples that can in principle be any one (as we can submit each example more funejion A 
than once), but is actually 4 if we want to compute the MSE after having fed mistakes 

the network with a balanced training set 19. To minimize this function, we act 

on the parameter vector w made of 9 components (4 connection weights from 

input to hidden nodes plus 2 weights from hidden to output and 1 threshold for 

each of the 3 non input nodes). We drive w in the parameter space through 

elementary variations dy that are accepted according to the Metropolis rule through randomly 
(5.24) with a suitable temperature schedule. Figure 5.12 shows the course of s “#¥"& walks. 
with the number of moves [Amato et al., 1991]. The residual error denotes that 

in all 4 instances the network computes correctly x5 in the sense that it outputs 

a value very close to 1 when #5 = 1 and to 0 otherwise. The risk function 

E|Xs5 — h(X1, X2)| may assume slightly different values, depending on how the 

input distribution law is unbalanced in relation to the training set composition. 


Graphs like in Fig. 5.12, that report the behavior of some performance indices 
with the progress of the training algorithm, will use different variables in ab- 
scissa to count the algorithm iterations. It depends partly on the algorithm 


19 This is a specially lucky case where we can cover with the training set the whole population 
of input-output pairs balancing the two kinds of labels. See later on for what concerns the 
packing of the training set and the cadence of the weights updating. 


268 Subsymbolic learning 


The clock of the peculiarities, but very often on the experimenter preferences. When not differ- 
learning process "i : g 5 : g 
ently specified, we will count these iterations in number of updates, i.e. steps, 
if they increment after the processing of a single example of the training set. 
Otherwise we will refer to learning cycles, each grouping a number of contiguous 
steps expressly specified. 


5.4.2.2 Gradient descent in a deterministic framework 


If we have that the cost is a continuous function #(w), then we know its minima 

lie in points where the first derivative with respect to every wij is 0. This 

occurs for instance in a fully connected network of neurons activated as in (5.6). 

A deterministic Then, like in many NP-hard search problems, the algorithm for finding an exact 
parameter space solution is well defined, but its implementation might prove computationally 


unfeasible. In our example even the set of v? equations 


Or(w) 


Owi1 
: (5.29) 
Or(w) 


Â 
OWyy 


that, thanks to is generally hard to solve analytically. Moreover, to distinguish minima from 
the continuity of z ip 3 ‘ 
the cost function, Maxima, more complex conditions must be checked on the second derivatives. 
Therefore we generally look for incremental methods, where we move along 
we address along directions in the parameter space where the derivatives of r do not increase, 
the steep descent th d ti l ld t of it 
of its landscape. thus denoting a local descent of it. 
This strategy has many drawbacks. The first is represented by the local 
minima traps. From Fig. 5.13 we see that once the function comes close to 
a minimum like A, any moderate displacement of it gets reabsorbed by the 
subsequent move toward the minimum again. Another is that you have no 
guarantee in general about the running time even to reach so poor a minimum. 
All depends on the length of the step you take in the descent direction: if 
it is too long, it could trespass the minimum; it too short, it could require an 
unsustainable time to get close to A. And then you can invent a lot of degenerate 
minimum situations where the point in the parameter space runs infinitely along 
a loop of two or more points, etc. 
A method as Despite these respectable considerations, the gradient descent strategy is 
robust as rough ee 
not only the one most commonly used for training a neural network, but the 
most successful one as well. The fact is that, in the absence of exploitable 
formal knowledge on the cost landscape, the more elementary the method the 
deserving many more robust it is. You can avoid drawbacks with additional artifices that do 
assessments up A . . . . . . 
tous. not conflict with this fact, thanks precisely to its elementariness. Thus, in a 
Metropolis-like fashion, you get random shakes for escaping the local minima 
on the part of the processing sequence of the training set examples and suitably 
schedule the w updating, a point we will discuss later in greater detail. A 
dynamic variation of the step lengths might also have the same effect. The 


Learning strategies 269 


Fig. 5.13: A metaphor for the local descent optimization algorithm. The rolling ball 
on the left would get trapped in the first (local) minimum 4, rather than reach the 
(global) one 2’ on the right. 


tuning of both schedules and lengths is completely up to you, i.e. you are free 
to design heuristics for this task basing what you do mainly on both your flair 
for cooking and your knowledge about the network. For instance if you assume 
that some w,;’s are already well tuned, you might freeze them and continue 
training the remaining ones. 

On the basis of the above considerations, we have a huge amount of gradient 
descent-based algorithms in the literature. With reference to the architecture 
sketched in Fig. 5.3 we just cite a general way of computing the gradient of 
the cost function relative to w in the two forward and backward releases of 
propagating the sensitivity of the cost function along both the network nodes 
and the time evolution. 

Let us specify (5.7) as follows 


T(t +1) = g(r(t), x(t); w, t) (5.30) 


where x is the external input. Let us split the state vector T = (z,v) in two 

parts: z = inner states’ vector, and v = visible states’ vector. g is a nonlinear 
function resulting from the composition of the PE activation functions according 
to a connection mother graph. The only requirement is that it be derivable 
with respect to the w components. In other words, we ask that it is possible to 
chain on the graph the derivatives with respect to the free parameters from one 
to any other connected PE. Data channeling with linear weighting, as in the 
a(T) computation in (5.2), insures this property, provided that the functions 
computed by PE’s are derivable with respect to both the free parameters and 
the state vector 7°. Once the error function E(w) on a trajectory of T clocks is 
stated: 


E(w) = e(vi(w),..., vr(w)) (5.31) 


20In this dynamic system we may also put PE’s which compute Boolean functions, hence 
do not satisfy the mentioned requirement. One way of solving this drawback is to give the 
output of these neurons the meaning of membership degree of the alternative output in the 
fuzzy set [Zadeh and Kacprzyk, 1992] “good alternative”. Thus we go back to dealing with 
continuous variables, getting derivability of the function computed by the neuron. 


A general neural 
network 
computes a 
recursive function 


For whatever | 
cost function is 
accumulated with 
time 


we suitably split 
its gradient 


getting either a 
forward recursive 
definition of it 


or a backward 
, „ recursive 
definition of it. 


Specifying the 
_ backward 
definition to a 
network of sole 
subsymboloic 
neurons 


270 


Subsymbolic learning 


the computation of its gradient along the network evolution can be implemented 
according to two alternatives. They consist in either forward or backward prop- 
agating the current derivative values with time. Meaning by Jf (œ) the Jacobian 
of the function f and by Vf(q@) the gradient of the function f with respect to 
their argument a 7!, and specifying where necessary the parameter space in the 
subscript of J or V, we can write according to the derivation chain rule: 


Jvi(w) T 
VE(w) = Ve(vi,... vr) = Vve(vi, -n vr)Jvi(w) (5.32) 
Jvr(w) t=1 
where 
JT1-1(w) 
0 
JT: (w) = Jg(T+-1, X1-1; w, t — 1) I 
0 


= Jr, 8(Tt-1, X11; Ww, t — 1) J 74-1 (w) + Jwe(Ti-1,X+-1; w,t— 1) (5.33) 


where J is the identity matrix and Jv is a submatrix of Jr. Equations (5.32) 
and (5.33) call directly for a forward procedure, detailed in Algorithm 5.3, which 
accumulates the gradient addends starting from t = 1 till t = T, with the obvious 
initial condition JTo = 0. If Vy, E(w) is a function of the sole values 71,..., T4 
(this happens for instance for E(w) = = e(v;)), then we can compute these 
addends in real time while updating the state vector according to the procedure 
in Algorithm 5.3. Here V+, E(w) is the obvious extension of Vy, E(w) with the 
addition of some null components. Let us now introduce the backward term 
6:(w) via the recursion relation 


6,(w) = Vir einen) + 6441(W) Je, 8(Tt, Xt; W, t) 1 <a (5.34) 
Vere(V1, 5 VT) ift=T 
After some algebra we obtain 
T 
VE(w) = 5 ôi(w)Jwg(Tt-1;,Xt+-1;W,t — 1) (5.35) 
t=1 


(5.34-5.35) suggest two-phases Algorithm 5.4 for backward computing 
VE(w). When the network consists solely of neurons, splitting VE(w) into 
Vae(vi,..., vr) (see (5.2)) and Ja(w) (it is just a matter of moving the deriva- 
tive of the activation function from one term to the other), the above procedure 
comes to the usual well known error back-propagation procedure. Namely, re- 
ferring ô to the single node k, the above formulas read 


E(w) 
OW jk 


= 06;T,, with (5.36) 


21 J£(q) is the matrix whose element of row i and column j is Of;(a)/Oa; and Vf(q) is 
the vector whose i-th element is Of (a) /0a;. 


Learning strategies 271 


Algorithm 5.3 Forward computation of the error function gradient. 


vector T=7T0; T1(w) 

vector w=w0; w 

vector grad_E=0; VE(w) 

matrix JT=0; JT+(w) 

vector F(vector 7,vector x,vector w,t); g(T1, Xt; w, t) 
matrix JwF(vector T,vector x,vector w,t); Jw (T1, Xt; w, t) 
matrix JrF(vector 7,vector x,vector w,t); Jr, &(T1, X14; w, t) 
vector grad_Er(t, vector T); Vr,e(V1,---, VT) 


for (t=1; t<=T; t++) 


{ 
Jr=JwF (7 ,x,w,t-1)+J7TF(7,x,w,t-1)*J7; //update Jacobian J,,(w) 
T=F(7T,x,w,t-1); //update state vector T 
grad Et=grad_Er(t,7)*JT //update gradient V E(w) 


Algorithm 5.4 Backward computation of the error function gradient. 


// forward phase 


vector T[T+1]; To(w),...,T77r(w) 
vector w=w0; w 

vector F(vector T,vector x,vector w,t); g(T1, Xz; w, t) 

7 [0]=70; 


for (t=1; t<=T; t++) 
{ 
T[t]=F(r[t-1] ,x[t-1] ,w,t-1); //update T 
ie 
// backward phase 
matrix JwF(vector T,vector x,vector w,t); Jw (Tt, Xt; w, t) 
matrix JrF(vector 7,vector x,vector w,t); Jr, &(T1, Xt; Ww, t) 


vector grad Er(t, vector T); Vr,e(V1,---, VT) 
vector delta=grad_Er(T, 7T); 64(w) 
vector grad_E=0; VE(w) 


for (t=T; t>=1; t--) 

{ 
grad_E+=delta*JwF (r [t-1] ,x[t-1],w,t-1); //update VE(w) 
delta=grad_Er(t-1,7)+ 
delta*JrF (r [t-1] ,x[t-1] ,w,t-1); //backpropagate 6;(w) 


} 


272 Subsymbolic learning 


se E=) > 


A 
i] 
i © © ôi = (h(x)i — TP)’ (ai) 
A 


AWwij = —6i7; 


1 


L-1 © j G) 55 = 0 (aj) pa Sn wn; 


Awjr = —16;TR 


Wyk 

1 
2g & O © 
Ih ON eT TM fh | 

| RZA | | WW | 

A 2N 
fo. a Po E, 
ly yol ly hol 

f N rA N 
wO O OO 
x, Forward Backward 


Fig. 5.14: Forward and backward phase of back-propagation algorithm, where E = 


Ti (h(a) — 72)? and Awi = - B22. 


T a ET (aj) if j denotes an output neuron 
: ao’ (aj) 4, SnWnyj ifj denotes a hidden neuron 
ee obtain ithe where ¢ has the topological meaning of a layer index if the nodes are arranged in 
learning layers according to multilayer perceptron layout; nodes of a same layer have feed- 
procedure. forward connections with all and only nodes of the subsequent one; æ represents 
the input sequence that the network processes to produce the visible states h(a) 
along its evolution. This allows the figuration of the algorithm as in Fig. 5.14 
in the form of a round-way trip as remarked by the layered indicization of the 
parameters: one way propagating the signal from input to output, the other 
way propagating the error from output to input. 
There are many ways to use the gradient. The most elementary is to make 
a step in the parameters space exactly in the direction of the E(w) steepest 
descent. This corresponds to updating each parameter according to 


dE(w) 
Ow; 


Wij = Wij — 1) (5.37) 


It represents an algorithmic counterpart of a local independence assumption of 
the single weight influences. The step width 7 is denoted learning rate. 


Learning strategies 273 


Example 5.6. Learning the XOR function, a task already present in Example 
5.5, is the paradigmatic example of highly complex function learnable through 
a neural network. The main idea is to drive this vector by a fixed amount in 
the parameter space along the direction —Vws of the gradient descent of the 
pivot s (5.1), ie. the direction identified by a 9-component vector for a net- 
work as in Example 5.5, where each component is proportional to the opposite 
of the derivative of s w.r.t. the corresponding component of w. We will dis- 
cuss benefits and drawbacks of this strategy in the next section. The greatest 
drawback however is the possibility of drawing w into a local minimum of s. 
To improve the training’s efficiency we tried the following strategy. We pursue 
the qualitative idea that a candidate transition w > w + dw be accepted with 
high probability if dy forms a small angle a with —Vyws, with small but non 
vanishing probability otherwise. Moreover, the need to damping high frequency 
oscillations around —Vws, which would be an annoying artifact of annealing, 
suggests in fact the viability of moving along —Vws itself when the proposed 
step forms a small enough angle w.r.t. -Vws. More specifically having drawn 
with uniform probability a unit vector d in the parameter space, we perform 
a step of length 7 in the direction —Vws if |a| is below a fixed threshold q; 
otherwise we perform a step of length 7 in the direction d with probability 
exp(—(/2|a|). The resulting procedure is as follows: We rule the algorithm in 


Algorithm 5.5 Pseudocode of the DAMPING procedure. 


Procedure: DAMPING 
Begin 
get an initial vector Wo; 
Repeat 
Compute the steepest descent direction VW; 
Chose randomly a neighbour direction d; 
Let a=d-VWo/||Wol|; 
if |a| <q Wo = Wo — nV Wo; 
else let Wọ = Wo — nd with probability exp(—(/2|a|); 
until time is expired 
output Wo; 
end 


the frame of a simulated annealing procedure, so that its parameters are n, q, 3 
and the coefficient c of the cooling schedule 3 = 8o exp(cv), where v is the num- 
ber of the current learning cycle; typically q = 7/4. In Fig. 5.15 we show, for 
the risk function (5.15) the speeder learning achieved by DAMPING procedure, 
in respect to the simple back-propagation algorithm. 


Example 5.7. We may subsymbolically learn the circle class in Example 3.4 
simply by training a neural network to answer 1 when a point x is in the circle, 


Learning XOR 
again 


plus a Metropolis 
inspired drift 


Learning a circle 


is easy enough. 


274 Subsymbolic learning 


.83 


Fig. 5.15: Comparison between damped (plain line) and standard (dashed line) back- 
propagation algorithms for learning a XOR. 


0 otherwise. Of course, now we cannot guarantee that the function h computing 
this label as an approximation of the goal circle is a circle too (see Fig. 5.16). 
We only expect that the probability of having point x such that c(x) 4 h(x) in 
the future is very low with good confidence. Figure 5.17 shows the behavior of 
relevant parameters of this learning experiment when a 2-3-1 three-layer network 
with 2 inputs, one output and 3 intermediate neurons is used to learn the circle 
of center (0,0) and radius 2. A training set of 200 elements uniformly drawn 
from [—3, 3]? is given in input to a back-propagation algorithm run for 20000 
iterations with learning rate 0.01. As a further expedient, we found useful to 
give target output 0.9 to the positive examples and 0.1 to the negative examples, 
while h(a) is computed by thresholding the state of the output neuron. The 
figure reports the results we obtained using the JNNS simulator [Fischer, 2003]: 
the black curve in the figure reports exactly pivot s of the twisting argument 
(5.1). At least we want to be able to reduce its value. Then we test the 
effectiveness of this statistic by checking s as an estimate F of Z = MSE ona 
test set of 100 elements (the gray curve), i.e. we check the labels produced by 
the network on a new set of points not used for training it. In the figure we 
see that in this learning instance we are working with an estimator essentially 
consistent: the larger the training phase, the smaller the MSE evaluated on 
the test set, the reciprocal position of the two curves being an accident coming 
from the sampled points. Finally, as an inner indicator of the training evolution 
we consider the sum Aw of the square changes in the connection weights. We 
see a substantial coherence between the course of the three curves: s, B and 
Aw. With the hypothesis computed at the end of the whole training phase we 
gave label of all points of the rastered [—3, 3]? square in Fig. 5.16, obtaining a 
separator curve very close to the original circle. 


Learning strategies 275 


Fig. 5.16: Recovering a circle with the back-propagation algorithm. Black circle: 
learning target; gray and black region denote the points classified as positive and 
negative, respectively, by the neural network. 


8s, R Aw 
2 0.0002 

15 

10 0.0001 


a 


000 10000 15000 20000 


put(136,5): 


(a) (b) 


Fig. 5.17: (a) Graph of the errors versus number of iterations 7 in the learning of Fig. 
5.16. Black and gray curves denote the mean square error in the training and test 
sets, respectively. (b) Graph of the sum of square changes in connections weights with 
iterations. 


Learning to 
control a flexible 
arm 


requires a hybrid 
system 


merging formal 
knowledge with 
subsymbolic 
adaptation. 


276 Subsymbolic learning 


Fig. 5.18: Motion of the flexible arm. (x,y): fix cartesian frame; r abscissa of a frame 
rotating with the arm; m: payload mass; t: time; J: rotation angle; w: transverse 
displacement. 


Example 5.8. A more complex example rooted on an actual operational frame- 
work is represented by the classical control problem of ruling the motion of a 
flexible arm as schematized in Fig. 5.18. The goal is that at the end of its stroke 
it brings a payload exactly in a position rotated at an angle J in comparison to 
the initial one, without vibration so that the payload might be kept by another 
device. The target must be reached whatever the assigned rotation angle, and 
whatever the mass it finds at its tip. 

Though linear, this control is made difficult by the vibrations induced by the 
rotation of the hub, possibly with a high speed. Exploiting the knowledge about 
i) the linear and adaptive control theory, ii) the dynamics of a flexible body, 
and iii) the training of a hybrid system, we may solve the problem through the 
architecture shown in Fig. 5.19. Details of the architecture and the algorithm 
may be found in [Apolloni et al., 2001b]. Here we simply point out some main 
features: 


e the whole architecture may be seen as a hybrid system to be trained 
globally; 


e the subsymbolic part is constituted by: i) the NN control module deputed 
to compute the control set point; and ii) the NN mass identification module 
deputed to identify the payload mass; 


e the symbolic part is constituted by: i) the reference model module that 
indicates an optimal behavior of the arm tip vibration to get 0 banding 
at the stroke end (see curve (4) in Fig. 5.20); and ii) the symbolic plant 
model module deputed to simulate the plant dynamics; 


e the Plant box identifies the physical device; 


e the general philosophy is the following. The symbolic modules do not need 
training as they are designed according to results of System Theory. In 
the forward phase a control signal is generated by the NN control module 


Learning strategies 277 


reference set point © © reference vibration 
reference model 


reference state 


mass ident. 
error 


Fig. 5.19: Block diagram of the neuro-control system. 


Fig. 5.20: Performance of the hybrid controller trained on a single task: mass (to be 
identified) = 0.156 kg, stop angle = 7/6 rad, arm length L = 1.05 m. z-axis: time 
(in seconds); y-axis: (1) motor acceleration w(t); (2) power consumption; (3) motor 
velocity w(t); (4) tip vibration w(L,t); (5) tip angle z(t); (6) motor angle W(t); (7) 
payload mass estimate. The vertical scales are linear but not uniform between the 
graphs. Curves (5) and (6) have the same scale. 


278 


Subsymbolic learning 


Fig. 5.21: The operational range of the control system. Black and gray bullets distin- 
guish the set of examples used, respectively, for training and testing the system. 


on the basis of the signals coming from the other modules and the true 
plant, and is put in input to the latter to operate it. In the backward 
phase the difference between actual plant and a reference vibration tra- 
jectory is piped back through the plant model and the neural modules to 
train the latter. Inside the plant we assume a linear PD (proportional 
derivative) controller mounted so that the control signal generated by the 
NN controller module is exactly the PD set point that is changed over 
time so that the plant follows the reference trajectory; 


the plant model is described by a series of linear difference equations that 
satisfy the derivability requisites of the symbolic PE recursive function; 


the learning algorithm is based on the backward gradient computation 
described in Algorithm 5.4, where only the parameters of neural modules 
are updated. 


The performance of the set up neurocontrol system is exemplified by the tra- 
jectories in Fig. 5.20. The generalization capability of the system spans a range 
of operational parameters (payload mass and rotation angle) around ten times 
larger than the ranges explored within the training set (see Fig. 5.21). 


279 


Learning strategies 


5.4.2.3 Gradient descent in a stochastic framework 


When the output of a PE is produced through a sampling mechanism such as 
(2.1) where p is given by the probability (5.5), minimizing w.r.t. w a function R 
depending on a set of these outputs translates again into a twisting argument. 
We are searching for a w modification, as a free parameter of the explaining 
function, which diminishes R for whatever underlying uniform seed u. We solve 
this problem approximately by referring to a twisted function r, for instance 
r = ER] ?? hoping that the weight updating rule we found really works also on 
R. This happens for 7 as in (5.19). 

Since the distribution law exhibited by the neurons of the visible belt in Fig. 
5.5 depends on the neuron activation schedule, we assume a parallel activation 
mode in analogy to the deterministic network discussed in Sec. 5.4.2.2. Namely, 
given the configuration T of the network at time t, each neuron updates simul- 
taneously with yet independently of all the others, according to the individual 
stochastic transition rule (5.5). 

Thus the configuration T of the network performs a random walk with t on 
¥ = {0,1}”. We describe this process considering T a random variable T(t) 
that depends on ¢ and detailing the transition probability from 7 to 7’: 
it P(T, 


Pr = P(T(t +1) = T'|T(¢) y(t +1) =7;|T(t) = 7) 


Vv 


1 
~ II 1 + e80=27/)ai(r) (58) 
This insures the convergence to the asymptotical distribution law 
eck e 82 (2,7) 
™3(T) = = ea (5.39) 
papers e- BZ (oP) 
with 
LTT) =-7'w(r)- Y(T +7’) (5.40) 


We consider the derivative of the relative entropy J directly w.r.t. the plasticity 
parameters: 


Ola 


Ow; 


=8[ (En (OT (+V)- EROE) +i] (641) 
where the term i +> j denotes a new addend in the expression in square brackets, 
that is obtained by the first addend after having swapped subscripts i and j. 
Here the expectations considered refer to the asymptotic distribution laws in- 
duced by the transition probabilities (5.38). Thus they are independent of t, and 
the probability measure appearing in the subscript comes from the marginal- 
ization of this distribution to the visible nodes at time t. In other words, we 
consider two evolution modes of T. In the unclamped mode the whole state is 


22Note that r is quite different from the risk R; the former is a fixed value, the latter a 
random variable. 


A twisting 
argument for 
minimizing a cost 
function 

getting direction 
for lowering the 
statistic directly 


from the 
population cost. 


Parallel random 
activation 


giving a 
transition matrix 
of a Markov 
chain 


with certified 
equilibrium 
distribution. 


The derivative of 
the cost function 


evokes two 
companion 
evolutions 


a dream phase 

insensitive to | 
external stimuli, 
and 


an awake phase 
affected by them. 


The gradient is 

estimated 
through second 
order statistics 


Courses in the 
state space, 


courses in the 
parameter space 


the former 

_ , toward a 
minimum of a 
quadratic 
function 

the latter toward 


a minimum of the 
relative entropy 


280 Subsymbolic learning 


left free to evolve starting from a random distribution law; after quite a while 
its distribution law nears (5.39). In the clamped mode the situation is the same, 
except that the states of the visible nodes are maintained fixed to a vector of 
values supplied by the environment distribution law y. Er, denotes the ex- 
pected value according to the first distribution. Ey is a bit more complex. It 
denotes the expected value of a distribution having y ruling the visible nodes, 
and an equilibrium distribution generated by the transition probabilities (5.5) 
ruling the remaining nodes. We may imagine the latter as the result of iterat- 
edly clumping the visible nodes to values supplied by y and waiting to reach 
stationarity with the dynamics (5.38). This is a rather farraginous procedure 
that we will simplify at the end of this section. 

Then we assume that a move toward the descent of T, as in (5.19), is obtained 
along the direction identified by (5.41) with expectations E[T;(t)T; (t+ 1)] on the 
population substituted with sample means + X}; Ti(tk)Tj(tk +1) on the same 
variables. To estimate the expected values in (5.41) we launch the network in 
its stochastic evolution. Once we assume that the process is close to equilibrium 
we start computing our statistics. Namely, we estimate the expected values Er, 
and E, through sample means, where the sampled values of T;(t + 1)Tj(t) are 
collected for Er, at each state updating in the unclamped mode. Instead, the 
values for E, are collected only when the environmental state vector matches 
the sampled environment variable (i.e. after having clamped the visible nodes 
with these variables and reached the equilibrium conditions). 

The distinctive way our learning algorithm estimates the above derivatives 
and implements the descent along the parameter’s surface derives from the fol- 
lowing considerations: 


e the dynamics of stochastic BM converges toward the minimum of 
L(h(r),7), with h(T) = o(a(7)) and ø as in (5.4), i.e. the deterministic 
update of 7. It coincides with the minimum of #(7,7) (a function per- 
fectly determined by the network connection weights), apart from chimeric 
minima constituted by loops between two points (limit cycles of order 2 
[Aarts and Korst, 1989]); 


e in turn, 
Ow aly (ZTO TO) 
Ow; = g Ow; 


Ow; 


y (SLOT) 4, =| (5.42) 


Thus we can read the dynamics of the learning algorithm as follows: 


e the neuron’s state walks around the minima of a quadratic function (apart 
from limit cycles); 


e the network parameters walk toward the minimum of a relative entropy; 


Learning strategies 281 


e for a sufficiently large Ø, a fixed point in the network dynamics is con- 
stituted by a pair (w,) where the difference between the minimum of 
L£(r',7) and the value of this function marginally averaged to the distri- 
bution law y is minimal; 


e in absence of the enviromental terms — those averaged on y — in (5.42) the 
descent according to the rest of the derivative should induce the network 
to become an aggregate of totally independent neurons flipping with equal 
probability between states 0 and 1. This neuron’s sprightliness is moder- 
ated by the contribution of the former terms, and is geared to reproducing, 
at least in part, the structure of the enviromental distribution. 


Thus the initial thermicity of the process is partly preserved in the thermal 
core of the network in Fig. 5.5, which is never reached by external stimuli. 
Instead, intermediate belt neurons are devoted to give structure to the visible 
distribution law. This creates the inverse of the cumulative distribution function 
of the random variable we want to generate, i.e. the explaining function of the 
sampling mechanism we want to implement. 

The borderline between the two belts moves with the learning process. The 
more structure there is in the mapping from thermal to environmental states, 
the greater the number of neurons converting from thermal to intermediate: 
this produces the twofold effect of enriching the distribution law of the visible 
nodes’ state and lowering the entropy of the thermal belt. 

However, quite unlike the case of uniform distribution, we might expect that 
to a given neuron inside the intermediate belt only a blob of surrounding units 
prove effective to its dynamical trend, instead the rest of neurons act as thermal 
units as well. 

Are Xp] Ti(te)7; (tk + 1) sufficient statistics? The answer is partially posi- 
tive. Let us generalize the notion of Boltzmann machine as follows: First of all, 
in order to use elegant results from harmonic analysis we shift our configuration 
space from ¥ = {0,1}” to Y = {-1,1}”. Thus s = (s1, ..., Sy) E Y now denotes 
the network configuration. 

We call A, = {1,2,...,v} and each subset M C A, a block. To each block 
M we associate a block variable oj, defined by 


OM = II si if M is non empty, 1 otherwise (5.43) 
ieM 
Elementary considerations of harmonic analysis about the two-elements mul- 
tiplicative group {—1,1} show that any probability function P on Y can be 
written in the form 


1 
P(s) = av 5 PMOM (5.44) 
MCA, 


where pm = E(Sar) and Sy denotes the random variable whose specifications 
are om. The above considerations show that the set of moments {pm, M 4 0} 


Solution: a fixed 
point for the two 


dynamics 


If not | N 
constrained, this 
point would be a 
uniform states’ 
distribution law 


Otherwise the 
intermediate belt 
neurons are 
molded to cope 
with the 
environment 


The harmonic 
components of 


signal 


summed over a 
sample constitute 
a set of sufficient 

statistics. 


A generalized 

_ Boltzmann 
machine bases its 
transition 
probability on 
harmonic 
components 


and its 
adaptation on 
derivatives of a 
relative entropy 
on the above 
statistics. 


They push the 
machine toward 
the target 
distribution 


Canonical 
Boltzmann 
machines work 
with sole second 
order harmonics 


Sequential 


activation learns 
with simpler 
formulas 


282 


Subsymbolic learning 


determines a coordinate system on the 2” — 1 dimensional manifold PM() of 
all probability measures on Y. 

Here, the set of sample moments 4 Ya of, constitutes a set of joint suffi- 
cient statistics. But we are working only with second order moments. However, 
consider a generalized issue of the Boltzmann machine defined by the following 
dynamics: 


1 
1+ et? X mca iem ô MOM- {i} 


P(S;(¢ +1) = +1|S(t) = s) = 


(5.45) 


for suitable 0m, where the equality reads twice, once selecting +, then — in 


Its stationary distribution is P as in (5.44). The incremental training strat- 
egy based on the cost function (5.18) naturally extends as follows. 

Let Pp and P; be two probability functions. It is trivial to compute that for 
each nonempty block M C A,, it is 


ð Pı (S) 
au [m Ea = (pm (0) — pm(1)) 
where 0 or 1 in argument of pm indicate whether or not the corresponding 
quantity should be computed using P or P4. 

Some less elementary considerations [Apolloni et al., 1999] allow us to state 
that any move in the parameter space along even a single coordinate of a quan- 
tity n(pm(0) — pm(1)) pushes Pı closer to P) for 7 > 0 and suitably small. 
We hope to do the same operation with sample moments, strictly of order less 
or equal to 2 in particular. Moreover, to strenghten this hope we enrich the 
network with the intermediate belt nodes allowing to code more complex (than 
the second order) dependencies. 

If we rule a Markov chain via a sequential updating of the Boltzmann ma- 
chine, (5.39) and (5.40) simplify as follows 


(5.46) 


e-BL(r) 
ma(T) = Tg PE (5.47) 
with 
L(r) = —Tw(r) — 2yr (5.48) 
actually, L(r) = Z (T, T), and 
Oly 
a p (Ex, (T:T;) — E (T:T;)] (5.49) 
Wij 


This reads a simplified version of (5.41), where the state time dependence dis- 
appears since T; and T} refer to a same time clock and the reverse term dis- 
appears for the same reason. The identification of the two, clamped and un- 
clamped, learning phases persists, but the collection of statistics around the 


Learning strategies 283 


w30 Ne 


Fig. 5.22: A 2-2-1 stochastic multilayer perceptron: 7 € {0,1}; the double arrows 
indicate the symmetry of the coupling constants wij = Wji. 


former now is easier. All we must do is to bring the Markov chain at a certain 

time to to the equilibrium distribution with visible nodes clamped to some target 

state assignment; then we observe the machine sequentially to collect statistics 

Ł D Ti(k)r;(k). Still, this procedure takes too long to get equilibrium con- still requiring us 
ditions. We may wait hours even to get a stationary distribution of the states of ooh 

a 100-node Boltzmann machine. This is one reason why this learning paradigm achievement. 
is commonly used for studying natural phenomena but generally avoided for 

computing solutions of numerical problems. 


Example 5.9. Consider again the 2-2-1 architecture employed for learning an Learning XOR 
XOR via back-propagation. Now allow the connection to be two-way with the Win Boltzmann 
sole constraint of attributing a same weight to each direction (namely wi; = Wji, 

see Fig. 5.22). Then activate the neurons sequentially with the probabilistic rule 

(5.5) so that it constitutes a sequential Boltzmann machine. Then train the 

network using the gradient (5.49) estimated through sample means as described 

previously. For fixed G = 5 and 7 = 0.001 Fig. 5.23(a) (from [Apolloni et al., 

1992]) shows the course of the percentage of success with the training cycles. A 

success is recorded when a correct pair ((T1, T2), To) is exhibited by the machine 

in the unclamped phase. A training cycle corresponds to the exhibition of a 

correct pair randomly selected and corresponding computations; we compute 

the percentages every ten cycles after having increased (3 to 100 to reduce the 

variance of the output. Percentage 0.5 denotes a machine that essentially tosses 

a coin for associating To to (71,72); with 0.7 the machine loses one of the four 

examples. Due to the randomness of the network, even in the case of a complete 

success the percentage is a bit lower than 1. The difference between the two 

figures comes from a different initialization of the weights: we obtain a better 
performance when no prejudice is involved in this operation, i.e. when all weights possibly without 
are initially set to 0 (see Fig. 5.23(b)). Finally, Fig. 5.24 (from [Apolloni et al., Putte 
1992]) reports an analogous behavior when the networks is still an MLP with 


A composition of 
elementary 
functions 


284 Subsymbolic learning 


= Fre T 
200 400 600 800 1000 1200 1400 1600 1800 2000 200 400 600 800 1000 12001400 1600 1800 2000 


(a) (b) 


Fig. 5.23: Training the symmetric MLP in Fig. 5.22 to learn the XOR function with 
sequential learning rule; (a) initial wi; uniform in (—9,9); (b) initial wi; = 0. X 
axis: steps, Y axis: percentage of success, curves: different evolutions of the training 
process. 


oriented connections (as usual) yet with random activation according to (5.5). In 
this case, too, we can aim at minimizing a relative entropy with some steepest 
descent technique. Details of the algorithm can be found in [Apolloni et al., 
1991b]. A similar network learns faster and with a high percentage of success. 
It exhibits no prejudice and allows some learning refinement even after many 
training cycles (an attitude opposite to the weight sclerotization). 


5.4.2.4 Exploring the search space in parallel 


Computing the cost function gradient poses no great trouble per se. But if 
cost is a complex function of the examples, i.e. if the neural network computing 
the goal output is complex, sloping down along the gradient direction leads in- 
evitably to a local minimum. Random smacks are the most common ingredient 
for supplying escape routes from these minima. However, they simply relieve the 
above drawback without removing it. Thus, even in the Boltzmann machine, 
where each network activity is random by definition, problems such as how to 
dimension the learning rate or when to stop the training process, persist. A 
more structural escape route has recently been pursued by ensemble methods. 
The general strategy lies in a multistarting procedure of a minimization process 
of a simple cost function, i.e. related to a simple neural network. We have al- 
ready seen, for instance, that if the network is a linearly-activated perceptron, 
then gradient descent methods ensure approaching the unique (hence global) 
minimum in a few iterations. Of course with this elementary network you may 
compute only hyperplanes, which may fit or separate satisfactorily only small 


Learning strategies 285 


T T T 
200 400 600 800 1000 1201400 1600 1800 2000 


Fig. 5.24: Training a 2-2-1 feed-forward network to learn the XOR function. Percent- 
age of higher trajectories: 92%. Same notation as in Fig. 5.23. 


subsets of your training set. Hence you need more than one of these hyper- 
planes — the reason behind the multistarting — and assign to another agent, 
maybe another perceptron, the job of best exploiting the outputs of the trained 
neural networks. See Fig. 5.25 for the general scheme. You may read in these 
terms the three-layer architecture in Fig. 5.11 that learns the XOR function 
and the most general one allowing the approximation of any recursive function 
as certified by Theorem 5.1. The novelty lies in the training of its parame- 
ters. In principle you may decide to not train the single elementary networks, 
leaving them their random initialization, and just train the supervisor agent to 
suitably mix their outputs. This is again a certain sampling mechanism where 
the elementary networks supply a seed randomly depending on the input, and 
the supervisor learns the explaining function. A strategy still with random 
preponderance, called bagging [Breiman, 1986], trains the single networks on 
randomly perturbed versions of the training set, thus easing the work of the 
supervisor. Finally, a less evidently randomly drawn strategy, called bootstrap 
(Efron, 1982], works cascade style: start with the first network and mark the 
examples that are not well learned, then continue with a second network whose 
training privileges the learning of the marked examples, and so on. 

We have plenty of both bagging and boosting strategies, all revolving, from 
a conceptual point of view, around the bias-variance decomposition options, 
somewhere read in terms of an analogous ambiguity decomposition options. 
The question is: what is our gain if we rely on the supervisor verdict in place 
of the answer of one of the neural networks? Focusing on the usual quadratic 
cost, we may use the usual equality: 


V 


S wilhi(r) — (7)? = Ds wi(hi(t) = h(r))* + (A(T) —e(r))? (5.50) 


where h is a conver combination of the hjs, i.e. h = D; wihi with OP wi = 1. 


that are easy to 
train 


locally on specific 
parts of the 
function. 


Mediating 
opinions proves 
efficient 


controlling their 
variance around a 
possible bias. 


Don't stress 
accuracy if the 
variance is high 


rather, a low 
variance may 
contrast some 
bias. 


286 Subsymbolic learning 


random NN 


random NN ee 
random NN |X > 
random NN | pence 
random NN Lee 


random NN 


Fig. 5.25: Exploring the search space in parallel through an ensemble of neural net- 
works. 


This equality ensures the advantage of using h in place of the worst hi, but 
its smaller efficiency with respect to around half of the used h;s. In order to 
consider the real benefit of h we typically consider this other equality: 


El(h(r) — C(r))"] = AT) - EICO)’ + ELECE- Cele)? (5-51) 


which highlights two ingredients of the dispersion of the values computed by 
the hypotheses about c w.r.t. exactly c’s values. The first term represents the 
square of the bias of our hypothesis that we may appreciate either on each T 
or along the whole function with a suitable metric. The second term represents 
the variance of C(T), which when possible we globally capture with confidence 
intervals for the entire C function. The two terms must be balanced in order 
to avoid waste of computational efforts. This is why, with complex functions 
denoting high variance, it is reasonable to employ simple perceptrons as base 
learners, with the benefit of having small computational efforts with a bias 
term that is comparable with the C variance. Note that in the Kolmogoorv 
framework we get similar conclusions, but by another means. Namely for H 
random variable and fixed c, (5.51) reads: 


E((H(r) — e(r))”] = E[(H(r) — E[H(r)])"] + (EIH (T) — e(7))* (5.52) 


Now the first term represents a variance and the second term the square of 
a bias. Moreover, the two terms are normally in competition: you lower the 
former with a consequence of increasing the latter, and vice versa, so that we 
look for a suitable break-even point. With reference to the instance previously 
considered, now the use of perceptrons is justified with their low learning com- 
plexity, hence low variance that compensates the roughness of the deriving H, 
hence its relatively high bias w.r.t. c. 


Learning strategies 287 


5.4.3 Dimensioning training and test sets 


In the absence of any information about the class of functions we want to learn, 
questioning about the size of the training set looks hopeless. Indeed a common 
strategy is to check a posteriori the adequacy of the training set by testing the 
generalization capability of the trained network on another set of data. This 
raises the general problem of understanding in regard to the available set of 
examples, how many to devote to training the network, how many to testing it. 
In this section we address two points: broadly estimating the complexity of the 
class learnt by a neural network, and suitably balancing the size of training and 
test sets. 


5.4.3.1 Class complexity 


Complexity indices such as detail or Vapnik-Chervonenkis dimensions (see Def- 
initions 3.4 and 3.8) have been defined expressly for Boolean functions. Al- 
though some extensions of Vapnik-Chervonenkis and analogous indices such as 
Rademacher complexity [Bartlett and Mendelson, 2002] are employed in the lit- 
erature, we prefer to start directly from Boolean functions for making guesses 
about real functions too. Consider an on-line learning procedure, where we 
add a new point to the training set once the learning algorithm has succeeded 
in computing a hypothesis consistent with the current set of examples. As in 
the PerceLearning algorithm, we will do nothing if the new point is correctly 
labeled by the current hypothesis; otherwise we search for a new hypothesis 
consistent with the augmented training set. In the latter case we also update 
a support vector contained in a subset of the current training set necessary for 
computing a hypothesis consistent with the whole set. The following fact holds: 


Fact 5.2. In the above online learning procedure: 


1. an upper bound estimate of Do=zn is given by the number of examples 
requiring an iteration of the learning procedure; 


2. a lower bound to Dco-n is represented by the maximum of the support 
vector cardinalities. 


Check: Each sample point requiring an update of the current hypothesis 
witnesses a consistency violation on the part of the latter. This denotes a 
symmetric difference of the current hypothesis with the concept lying in an 
analogous region generated by the updated hypothesis, hence a region to be 
sentineled exactly by the above point. We have no certainty that the point will 
maintain this witnessing role also in the future updated hypotheses, or that the 
set of such points collected during the training story is minimal in relation to this 
functionality. Therefore we speak of upper bound. From another perspective, 
since we take this set only on one of the possible stories, we speak of upper bound 
estimate. For the same reason we speak of lower bound when we check on each 
current hypothesis the minimal number of points necessary for determining a 
consistent hypothesis. 


Balancing 
training and test 
sets 


A general online 
learning 
procedure 


with on-line 
estimate of 
cth 


No more sentry 
points than 
updating points 


Sentry points of 
a single story are 
fewer than for 
every story 


Closeness rather 
than consistency 


For a fixed 
hypothesis the 
error is a pure 

Bernoulli variable 


A test set can 
confirm but never 
improve the 
learning accuracy, 
hence 


a suitable test 
set is no larger 
than the related 
training set. 


288 Subsymbolic learning 


The above arguments capture the extreme weakness of the sentences in Fact 
5.2. They just work for giving a preliminary idea of the class complexity. Within 
these limits we may extend the complexity guess even to non Boolean functions. 
As mentioned earlier, here the consistency notion smooths into the one of satis- 
factory closeness of sample points to the graph of the hypothesized underlying 
function. In this case, too, we are driven to update the hypothesis when a point 
falls too far from the graph and we do not consider it an outlier. Thus you may 
keep the account of these innovation points as for the Boolean functions. 


5.4.3.2 Managing the available examples wisely 


As mentioned elsewhere, a dual way of considering sentry points is to analyze 
the degrees of freedom of a labeled sample in respect to a computed hypothesis. 
Actually, if in a sample of size m u points are used by the hypothesis (rather 
by its symmetric difference with a concept) to bind its expansion, only the 
remaining m — u points are free of lying either inside or outside the concept in 
a random way. Thus we say that in this case m — u represents the number of 
degrees of freedom of the sample. In respect to this parameter we have that: 


Fact 5.3. Once h is fixed, if we check it on another sample of size n, the degrees 
of freedom are now exactly n. Namely, if h is consistent with the whole test set, 
then: 


Fy,..,(€) 2 fe(1,n) (5.53) 


Inequality (5.53) has the following operational meaning. If she/he does not 
know Dc, the accuracy parameters € and 6 have for the learner a meaning 
more fuzzy than probabilistic. For a sufficiently large sample size m so that for 
instance I.(1,m) = 1 — ô, with £ and 6 low enough, if the learning algorithm 
produces a hypothesis h consistent with the sample we can be satisfied. Our 
only doubt is that the actual degrees of freedom are much lower than m, or 
that an unfair sample was drawn, ergo our perception such as “zero errors over 
m” is misleading. But if we test our hypotheses on n new examples and count 
zero errors again, the degrees of freedom are exactly n and we are sure of our 
perception. More formally, (0, £) is the confidence interval for the above measure 
at confidence level [-(1,m). As acknowledged in Fig. 5.7, an excessive Do, # is 
at the basis of the overfitting phenomenon. Vice versa, if our hypothesis is 
based on a poor sample we cannot expect any improvement from testing. This 
is the core of our claim. 


Theorem 5.3. For a concept class C on ¥ and any — known or unknown — 
accuracy target (e, 8) the ratio r between the cardinality n of the test set and m 
of the training set for achieving the target both in training and in testing is at 
most 1. In particular, in the simplifying assumption that the cardinality of the 
sentry sets equals Dc, for every symmetric difference and sentry function, if 
no details on the ¥ probability distribution are available, then, for Dc,c = u we 


1 
have r < TE: 


Learning strategies 289 


(a) (b) 


Fig. 5.26: Batch updating: the examples’ total contribution leads to a modification 
of the computed hypothesis in (a), while sum to 0 in (b), leaving the hypothesis 
unchanged. 


Proof. If we have no indication of the ¥ probability distribution, inversion of 
equations (5.53) and (3.22) gives the proof. Of course, in the case of almost 
consistent hypotheses the claim holds if we accept the same number of errors 
both in training and in generalization. In this case we extend relation (5.53) 
in the same way we pass from (3.15) to (3.22). If details on the distribution 
law are available, we can use them to spare some sentry points. This has the 
effect of pushing the ratio between n and m closer to 1, yet always from the 
bottom. El 


As in the previous section, we may extend this upper bound on r to classes of 
non Boolean functions too. We do so by referring the sample size to the number 
of innovation points requiring an update of the current hypothesis whereas we 
assume a single point to witness a right twisting argument relation for test set in 
analogy to (2.14). Of course it is a matter of vague directions done in the case 
that the inner learning mechanism is progressive, like in the on-line learning 
procedure outlined in the previous section. 


5.4.4 Piecewise building of the pivotal statistics 


Once we have decided the number of examples, a second choice concerns how 
they will be supplied to the training system. The most part of the weight 
updating rules sum the contribution coming from each example. For instance if 
we aim to minimize an MSE such as (5.15) an addend comes to the updating rule 
(5.37) from the error registered on each example. Now, in cases like Fig. 5.26(a), 
where we are learning a linear function through motions locally applied to the 
current hypothesis proportional to the errors, the sum of motions impresses 
on the dashed line a translation plus a rotation toward the target function 
(represented by a continuous line). In the case of Fig. 5.26(b), however, similar 
corrections sum to 0, thus maintaining the hypothesis in an MSE local minimum. 
Actually, the parameter update computed by the training algorithms considered 
in this chapter produces corrections smarter than linear ones. 


Each example 
addsa | 
contribution to 
the gradient; 


how many terms 
is it suitable to 
add before 
correcting 
weights? 


Empirical answers 


Only local blobs 


are sensitive to 
local inputs in a 


Boltzmann 
machine 


Thus, the awake 
phase is captured 


virtually during 


the dream 
evolution of a 
network 


possibly when 
locally its output 


coincides with 
the environment 
stimuli. 


290 


Subsymbolic learning 


— 
0.9 
0.8 
0.7 
0.6 
0.5 
5000 10000 15000 20000 í 


Fig. 5.27: Course of mean error s versus number of iterations 7 in learning to mirror 4 
bits through a 4-2-4 MLP, with varying batch size. Black curve: batch size = 16 (i.e. 
batch learning); dark gray curve: batch size = 4; light gray curve: batch size = 1 (i.e. 
on-line learning). 


Symmetry conditions might, however, generate analogous local minimum 
traps. We may avoid these by a different periodicity of the updating clock. 
Between the two extremes, where we change the parameters either after each 
example processing (which call for an on-line learning) or after presentation of 
the entire training set (denoted as batch mode), we may have a large variety of 
batch sizes cadencing the update time. As usual, we can compute an optimal 
batch size in very specific learning tasks. But generally the updating clock is a 
matter of “cooking”. In Fig. 5.27 we see the different behavior of the training 
algorithm for learning a mirroring (see Example 5.10) with different batch sizes. 


A special instance of identification of the batch size occurs with Boltzmann 
machines. As mentioned before, the main problem for a computational use of 
the Boltzmann machine is the time waste to reach stationary distributions. To 
overcome this problem we may consider batch sizes less than a single example. 
Namely, returning to the model in Fig. 5.5, under the mentioned assumption 
that connections from neurons outside local blobs just induce thermal noise, 
we can approximately set to zero the weights of these connections and locally 
identify blobs with a set of approximations of the Boltzmann machine whose 
only thermality is conditioned by the surrounding neurons. This constitutes the 
rationale for the following piecewise learning procedure. 

To estimate the expected values in (5.41) we launch the network in its 
stochastic evolution. Once we have assumed that the process is close to equilib- 
rium we start computing our statistics. Namely, we estimate the expected values 
Er and Ep through sample means, where the sampled values of T;(t+1)T;(t) are 
collected for Ey at each state updating. Instead, the values for Ey are collected 
only when, by the way, the environmental state vector matches the sampled 


Learning strategies 291 


Input 


Hidden 


Output 


Fig. 5.28: Boltzmann Machine learning to solve a MIRRORING problem. The i-th 
output neuron must reproduce the state of the i-th input neuron. 


environment variable. To avoid waiting too long for a match, we accept par- 
tial matchings as well, collecting EK, statistics only from the blob surrounding 
the matching neurons. The mentioned thermality of contiguous regions induces 
trajectories that bend around the steepest descent direction. But if we only 
accept matches affecting a significant part of the visible belt, this swinging is 
not a drawback. Moreover, as seen before, connections which are almost always 
updated — since they share many blobs — pursue the sole task of decreasing the 
autocorrelations of the related pair of nodes. This reduces their role to that 
of thermal nodes. In this way we sight a twofold objective: i) on one side to 
render E[T;(t)T; (t+ 1)] far from E[T;(¢ + n)T;(t + n+ 1)], ii) on the other side 
to avoid the general plague of the reinforcement learning procedures constituted 
by their inner instability: once the net learns a pattern, then reinforces it, pos- 
sibly abandoning the pursuit of the other items of the training set. The weight 
updating is asynchronous. Once a connection of the structured or visible belt 
has collected enough samples from the external stimuli it updates its weight. 
Boundary conditions on the parameter update rate which allow some pseudoer- 
godicity theorems to hold have been studied in the past (Sussmann, 1988]. Here 
we assume that the small asynchronous weight updates induced by our learning 
strategy meet similar constraints, so that the Boltzmann process remains close 
to equilibrium behavior throughout the learning process. 


Example 5.10. Consider the MIRRORING problem: reproduce on a set of 
visible neurons the same state vector as another set of visible neurons under the 
constraint that no direct connection exists between the two sets. 

We may implement the task on the architecture in Fig. 5.28. Here the 
thermal core is virtual, in the sense that it is represented by the neurons not 
involved in a local mirroring. Moreover, the notion of surroundings is quite 
natural: the hidden nodes represent the sole surroundings of each visible neuron. 

The learning procedure runs as outlined in Algorithm 5.6, in particular: 


e initialize() randomly sets the initial weights and thresholds; 


Hence local 
corrections too, 
with remaining 
neurons acting as 
thermal noise 
generator 


An elementary 
task: repeat 
what you sought 


Walk 


observe 


compare 


292 


Subsymbolic learning 


Algorithm 5.6 By the way learning. 


by_the_way_learning (lay-out, treshold) { 
external stopping-_rule(); 
external matching_criterion() ; 
external reach_equilibrium() ; 
initialize(); 
reach_equilibrium() ; 
while(stopping rule() == false) { 
example () ; 
do until(matching_criterion() == false) { 
walking() ; 
statistics-(); 
} 
statistics+(); 
if(mum. of positive statistics accumulated 
on connection (i,j) > threshold) 
update(i,j); 


example() draws an example (r”,7”) from the environment distribution 
law y(t”). Following [Hinton and Sejnowski, 1987] we may adopt the 
probability measure equally distributed over vectors with only one com- 


ponent set to 1. Namely, for a given length n of the vector to be mirrored: 


B aii { n if yi Ti 1 (5.54) 


a= otherwise 


This is a simple but absolutely not trivial learning instance, as we will 
show later on; 


walking() moves the simulation of the parallel dynamics (5.38) a step 
ahead. A distinguishing feature of our algorithm is that the machine 
evolves at a constant temperature, a few exceptions aside; 


statistics-() adds 7;(t—1)7;(t)+7i(t)7;(t+1) to the register Cm;; and 
1 to nij for each i, j; 


statistics+() subtracts 7;(¢—1)7;(t) +7: (¢)7;(¢+1) from Cm;; and adds 
it to Cm;;, subtracts 1 from n;; and adds 1 to ni for each connection (i, j) 


in the surrounding of matching neurons; 


Learning strategies 


293 


Fig. 5.29: The network loses one or two examples out of three. Same notation as in 


Fig. 5.23. Y axis: number of lost examples. 


e update(i, j) sets wi; = wij + B ( 


nj; to 0. 


+ 
Cm}; 
T 
ij 


n 


= Cm; ; 


>) 


ij 


and sets both n+ and 


1J 


stopping-rule() and matching-criterion() characterize the various issues 
of the learning strategy. The general aim is to avoid getting stuck in relative 
minima of I(w). This happens mainly when some connections dominate the 
others, definitely biasing the network to reproduce only one subset of the train- 
ing set. This arises when the network goes in such a vicious loop: the network 
learns a set of examples strongly, then it passes through these examples with a 
high frequency, so stregthening more and more some connections that mainly 
bias the signals in input to the various neurons in corrispondance to these ex- 


amples. In particular, being 2 the Hamming distance between the elements of 


the training set, if we relax the matching conditions we continue to strengthen 
the above connection also when we virtually clamp, through a partial matching 
on all zeroes, the network on actually other examples. It is a typical figure for 


networks that lose one or two examples out of three (see Fig. 5.29). 


In these senses matching_criterion() is a couple of nested rules for both 


defining the acceptable matches and devising the unbiasing tools. 


points are the following: 


The key 


e to reduce the time necessary for a complete matching, each partial match- 


ing is accepted; 


e to avoid bias, the learning is incremental. We train a pair of input and 
corresponding output nodes connected by a hidden one by first. Then 
we duplicate the structure sharing the hidden node with the replica and 
adding a new hidden node for easing the fusion. This is the configuration 
in Fig. 5.28. Once the new structure has been trained we reiterate the 


procedure: duplication plus sharing of the hidden nodes, and addition of 


update 


require many 
sagacities 


294 Subsymbolic learning 


128 1408 2688 3968 5248 6528 7808 9088 10368 11648 12928 


Fig. 5.30: Bay the way learning a mirroring on 32 bits orthogonal vectors, through a 
32-5-32 Boltzmann machine. Same notation as in Fig. 5.23. Y axis: percentages. 


a further hidden node. To avoid biases in the trained structure the hidden 
nodes have thresholds frozen to 0. 
everett s Feasible Fig. 5.30 shows the firm attainment of the zero target for the error frequency 
spots. recorded as in Fig. 5.23. The faced learning problem, though difficult per se, 
is so strictly structured to allow the above elegant incremental method. Other 
strategies have to be devised for more chaotic environment distributions. 


5.4.5 Hybrid learning systems 


Use everything Returning to the maximum likelihood principle implemented with the upper 

but be concise bound (5.9), the broad commitment is to get a compressed description of the 
training set. We may achieve this target through either subsymbolic or sym- 
bolic data description. The general criterion may be that, given a computational 
framework (machine, language, ...) for suitably describing a function on a train- 
ing set, a formula beats a neural network if its description length (including 
observed statistics for free parameters) is shorter than the neural network’s. 


Example 5.11. According to the above criterion: 


1. The symbolic description of the XOR function, for instance through the 
formula 


1 TiTa (1 zı)(1l x2) (5.55) 


beats its description through a neural network described by a 2 — 2 — 1 
MLP, namely 


o(—6.8530(—5.6462, — 3.4332 + 1.794)+ 
6.6630(—3.433a) — 3.4342 + 4.770) — 2.927) (5.56) 


Learning strategies 295 


Neural network Symbolic processing 


=: poe 


Fig. 5.31: Balancing the data explanation. 


where o denotes a sigmoidal activation function, learnt from the sample 
{(1, 1,0), (1,0, 1), (0,1, 0), (0,0, 0)} through the usual back-propagation al- 
gorithm; 


2. In classifying emotions in a phonetic database, we observed a C4.5 decison 
tree [Qunilan, 1993] consisting of 64 if-then-else rules on 74 features 
beaten by a Support Vector Machine with linear kernel on the same vari- 
ables [Fellenz et al., 2000]. 


We may also adopt an intermediate strategy relying on a mix of symbolic 
and subsymbolic processing elements. We explained this in Sec. 5.1, where the 
former account for what we formally know about the string sampling mechanism, 
and the subsymbolic part is committed to supplying what is still unknown. 
This subdivision of data processing attitudes is also reminiscent of how the 
human brain behaves. We may divide the journey from sensory data to symbols 
(or, more evocatively, “from synapses to rules”) into a connectionist and a 
symbolic part whose relative prominence is shown in Fig. 5.31 by the cursor 
at a point depending purely on the learning task at hand. The fusion point 
between the two parts consists in the symbols that are computed from the data 
by a neural network, passed to the Turing Machine, and used by the latter as 
propositional variables of Boolean formulas that must explain the data. Thus 
we may envisage a global learning process which has data features as input, 
symbolic rules as output, and the meaning of these symbols as an additional 
benefit. Its host may be a layered structure where the first part identifies with a 
neural network producing propositional variables and the remaining layers are 
Boolean gate arrays grouping these symbols in formulas of increasing structural 
complexity (see Fig. 5.32). The latter layers manage either disjunctions or 
conjunctions of properties found with similar operations on properties at a lower 
level, having as a starting point the features’ properties (denoted by the meaning 
of the symbols). From an informational viewpoint, the signal flows left to right 
evolving through increasingly compressed representations where the information 
is distributed in both the states of the processing elements and their connections. 
The training of the first part lies in the province of the algorithm studied in this 
chapter, while the second part may be learnt with symbolic procedures like those 
studied in Chapter 3. 


A MLP flows 
data from 
subsymbolic to 
symbolic PE’s 


Learning a 
function with no 
examples of its 
output 
(unsupervised 
learning) 


at least for 
recognizing past 
stories. 


296 Subsymbolic learning 


Fig. 5.32: Extended neural network. Circles and squares denote respectively neural 
and symbolic units. Arrow: data flow direction; dashed line: cross section between 
connectionist and symbolic processing. 


Wisely balancing symbolic and subsymbolic data explanation is a complex 
and sophisticated work. It requires a complex application as well of almost all 
the techniques we studied is this book. The reader may find a complete example 
in [Apolloni et al., 2004a]. 


5.5 Learning from no teacher 


The last step we consider in the vanishing of available information is represented 
by the scenario where even the labels of the examples are unknown. Adam has 
no clear idea yet of how his body really reacts to the food he eats. But in order 
to survive he needs to discern different foods and body reactions. This is not 
very unlike what happens when we want to understand which kind of e-mail 
might be useful to us, apart from get-rich-quick schemes or other joys in life we 
cannot mention in an academic book. 


5.5.1  Self-associative memories 


A preliminary ability we must secure is to recognize how to distinguish an ex- 
ample different from the ones we have already seen. As a primordial conceptual 
activity this can be done through a very simple mathematical tool we can imag- 
ine to be implemented on a set of linear threshold fully interconnected neurons 


Learning from no teacher 297 


as follows. Consider a binary vector € = (&,...,€ ) with & € {—1,1} ? and 
the following rule that fixes the connection weights: 


1 
Wij = JES (5.57) 
We call it imprinting mechanism since a network which received these weights 
recognizes the originating example within a set of €’s processed through the 
activation function h;(€) as follows: 


if pe, Wig §5 2 0 


1 
h:(€) = a ES a< (5.58) 


V 
= sign | $ wyg 
j=1 


With the original example we have that the state €’ produced by the network 
coincides with it. In fact we can easily recognize that for all 7 


sign 5 (že) G | =& 


j=1 


(5.59) 


When we accumulate new imprintings on the connections, due for instance to 
the vectors €',...,€’”, the generic weight becomes 


1 m 

kek 

ah al eg 
k=1 


and if vector €” (with h € {1,...,m}) is processed by this network, the output 


(5.60) 


¿i of neuron 7 now reads 


Dee ee 
(ses) 
k=1 


V 


& = sign yo 


j=1 


G 
(5.61) 


The sum in the brackets of the last term proves to be a noisy addend we call 
crosstalk. It does not_affect the recognition if its absolute value is less than 1; 
otherwise it may be €; # €P. The ineffectiveness of crosstalks is ensured when 
all vectors are mutually orthogonal, in the sense that i 3 Egh = 0 for all pairs 
h,k with h Æ k. Since we can have at most v orthogonal vectors of length v, 
this quantity represents an upper bound to the number of patterns that can be 
stored by the neural network, i.e. its capacity 74. Otherwise we can rely on the 
confidence with which the crosstalk falls in the interval [—1,1]. For instance, 
under the hypothesis that the addends in the crosstalk are independent, so we 
can approximately consider its distribution law Gaussian, with confidence 0.9900 
we can store a number of patterns up to lnv. Moreover a greater crosstalk 
may be an undramatic drawback if it occurs in a limited number of units. This 


gh | =sign | eh +> (= eg 


j=l kth 


23We use again this different normalization since it falls in the usual notation of physicists 
who developed this topic by first. 

?4 actually an unheavy upper bound if referred to the neural network in our brain composed 
of around 101! neurons. 


It is a trivial task 
if the story is 
unique 


but we may use 
the same solution 
for more complex 
stories 


provided they are 
a sequence of 


orthogonal 
vectors. 


With moderate 
non- 
orthogonalities 
we may store a 
lot of pictures in 
our mind 


Analogous results 
with stories 
similar to the 
past ones 


Even better, let 
recognize 


similarities in the 
stories 


The strategy is 
_ to have a 
partition of the 
pattern space 


highly penalizing 
misplacements. 


This corresponds 
to selecting the 
„class with 
minimal loss 
function value 


In case of 
quadratic loss the 
class center has 
the mean 
coordinates of 
the patterns in 
the class 


298 


Subsymbolic learning 


is comparable to recognizing the picture of a face even if a limited number of 
pixels has been corrupted over time. 


5.5.2 Self-organizing memories 


As long as we know how to recognize what is different, the subsequent question 
is: “different from what?”. The answer requires an ability to see similarities 
between patterns enabling us to recognize clusters of patterns. They become 
classes once we acknowledge their utility and reward them for this by giving 
them a name. Everybody knows what rice is, since it is a suitable and desirable 
food all over the world. On the contrary, we have as yet no commonly accepted 
classification of unsuitable e-mails. Thus we may identify the problem of es- 
tablishing clusters of patterns with that of minimizing the cost of an incorrect 
attribution of patterns to clusters, i.e. of misclassification once we decide to ac- 
tually use clusters as classes. Formally, let us have chosen the number of classes 
to be equal to m; a pattern y from a set Y might be attributed to one within 
the class set {d),...,dm}. We denote by d; the i-th class, by D; the decision of 
attributing y to it, and by /(D,,d,;) the comparative cost of attributing y to d; 
in place of dj. The problem consists in partitioning Y into m subsets through 
a decision rule D(y) such that 


m 


D= arg max 5. XC U(D(y), di) 


yey I=! 


whose solution depends on the shape of l, i.e. we want to sharply discriminate 
the cluster on the basis of the loss function. With this strategy the latter is not 
an ethical punishment tool for a mistake you did; it is only a way to select the 
class to which to efficiently attribute the pattern at hand. 


(5.62) 


Example 5.12. Assume / connected to a Cartesian distance, namely 


0 if D(y) = dj 
I(D(y), dj) = 7 ( ) j (5.63) 
(y — Ha;)” (Y — Ha,) otherwise 
where pq, is a suitable center of class dj. Then the solution is 
D(y) = arg min (y ~ pa, )” (Y — Ha) (5.64) 


A suitable center should in turn minimize the same loss function within the 
class. Namely 


Ha, =argmin 4 X` (y — u)” (y - n) (5.65) 
. yed; 
which, for an infinite Y asymptotically reads 
yea, = argmin {E((¥j — p)” (Y; — w)]} (5.66) 


Bibliographical notes and further readings 299 


) 


Fig. 5.33: The typical architecture of a Kohonen map: the lower layer is the set of 
input components, the upper layer is a set of neurons each indicating a different cluster 
of data. 


where Y ; denotes the random variable describing points attributed to d;. From 
Sec. 2.3.1 we know that the solution to this problem is the unbiased estimator 
of E[Y ;]. For instance, for Y; Gaussian variable this estimator is represented 
by the sample mean Yj. = 


The above example supplies the theoretical basis of the famous Kohonen algo- 
rithm for generating self-organizing maps [Kohonen, 2001]. Let us consider a 
two-layer feed-forward network like in Fig. 5.33, where the input layer collects 
the patterns and the output layer devotes one neuron to each of m classes. The 
connection weights represent the coordinates of the center of the class associated 
to a node. Thus, starting with a random initialization of w, on each pattern y: 


1. we attribute y to the closest output node weight vector, according to 
a suitable metric. For instance, if we adopt an Euclidean metric, we 
attribute y to i = arg mini {> 75-1 yj; wij}. Another widely used metric is 
the pseudogaussian norm 


pei exp{- (y — Ma)? (Y — Hr) /20}} 


elsewhere called radial basis function |Poggio and Girosi, 1990a], whose 
implementation requires the estimation of the ø as well. 


d;(y) (5.67) 


2. we update the class centers in view of making them close to the sample 
mean of the annexed patterns as follows 


wij = wij + (yi — wij) V (y, i) (5.68) 


where 7 is the learning rate, while V (y, (i) 1 if i is the class of y, 
diminishing till 0 according to a selected notion of contiguity between the 
classes. A typical shape of V is the Mexican hat in Fig. 5.34. 


5.6 Bibliographical notes and further readings 


Subsymbolic learning is a wide-brimmed hat covering a large spectrum of data 
processing methods where the computation target is not explicitly related to 


A general 
neuromimetic 
algorithm 


possibly 
generating 
Kohonen maps 


possibly assessing 
radial basis 
functions 


just by shifting 
the center of the 
class toward the 
mean of the 
annexed patterns. 


300 Subsymbolic learning 


7.5 


Fig. 5.34: A Mexican hat shaped V in (5.68). 


the data features. Thus we gently modify the processing parameters on the 
basis of generic feedbacks about how the actual outputs fit or misfit the tar- 
get. Many of these methods aim to have a common template in the neural 
network learning paradigm, namely an initially random generator of functions 
molded to produce a suitable map from input data to the outputs matching the 
learning task. This is the case for instance of the clustering methods, where 
the Kohonen networks efficiently organize the computations of the classical k- 
means algorithm [Duda and Hart, 1973]. Analogously, data analysis tools like 
principal component analysis [Jolliffe, 1986] or independent component analysis 
[Hyvärinen et al., 2001] find effecting implementation through either linear [Oja, 
1989] or nonlinear neural networks [Hyvärinen and Oja, 1998]. 

The genuine application however remains the fitting of experimental data 
through a nonlinear function. In no way claiming exhaustiveness, we quote 
some relevant differences between the various implementations and some re- 
lated papers the authors have read and so suggest to the reader. The dif- 
ferences lie in: i) the kind of data (continuous / Boolean [Le Cun, 1985, Ja- 
cobs et al., 1991], random / deterministic [Hinton and Sejnowski, 1987, Amari, 
1998], supervised / unsupervised [Apolloni, 1992, Kohonen, 1989, Carpen- 
ter and Grossberg, 1988, Roy et al., 1997]), ii) the kind of function (dis- 
crete / continuous [Bishop, 1995, Haykin, 1994, Gersho and Gray, 1992], de- 
terministic / random [MacKay, 1992] static map / recurrent map / random 
variable generator [Werbos, 1990, Williams and Zipser, 1989, Aarts and Ko- 
rst, 1989]), iii) the cost function driving the learning process (MSE / like- 
lihood / entropy / etc. [Peterson and Anderson, 1987, Paass, 1991, Amari 
et al., 1992], with or without regularization terms [Poggio and Girosi, 1990b]), 
the learning strategy (reward and/or punishment [Sutton and Barto, 1998] 
[Apolloni and de Falco, 1990], gradient based / stochastic optimization [Hecht- 
Nielsen, 1990], fixed / variable architecture [Fritzke, 1995, Reed, 1993], online 
/ batch [Saad, 1998], with bagging or not [Breiman, 1986], from an ensemble of 
networks or a single one [Jacobs et al., 1991]). 

In this chapter we frame the roots of a large part of these implementations in 
the inferential paradigm of this book and other related theories with the aim of 
capturing their statistical rationale. In particular, inequality (5.9) is a corollary 


Bibliographical notes and further readings 301 


of the Levin equation [Levin, 1974] representing a main result of the Kolmogorov 
complexity theory that the reader may study in books like the one recalled 
in Chapter 1 [Li and Vitanyi, 1997]. This inequality suggests a direction for 
finding sufficient statistics in an almost totally unknown sample space. Within 
this space the most common cost functions, e.g. MSE and Kullback distance 
estimator, find their justification in empirical success (denoting general hidden 
properties of the sample spaces) rather than in strong theoretical results. 

We talk about the minimization methods of the cost functions very briefly. 
The reader may find a deeper discussion of the Metropolis algorithms in [van 
Laarhoven and Aarts, 1987]. Instead, the derivation of back-propagation and the 
Boltzmann machine basic learning algorithm are key topics in a sort of prelim- 
inary bible of the connectionist paradigm constituted by the book [Rumelhart, 
1986]. Their general form can be found, among the others, in two papers by 
one of the authors [Apolloni and Zoppis, 1999, Apolloni et al., 199la]. The 
reader may find a general discussion of the hybrid architectures in some re- 
view papers [Hilario, 1997, Medsker, 1995, Dorffner, 1997, Wermter and Sun, 
2000]. The discussion of the training and test set sizes is quite succint; but no 
many general results are available in the literature, see for instance [Apolloni, 
1998, Baum and Haussler, 1989, Levin et al., 1989]. Boosting methods sketched 
in Sec. 5.4.2.3 may represent an escape way from the crucial problem of find- 
ing meaningful (possibly sufficient) statistics. In principle you get completely 
random functions as base statistics, looking for combining them in a simple and 
suitable way (for instance through a majority voting [Freund and R.E., 1995]) to 
get the output of the network. This strategy is currently much explored within 
the family of ensemble methods with a special emphasis on the bias-variance 
problem of balancing the roughness of the base statistics with the simplicity of 
their learning [Breiman, 1986]. 

Finally, in the last section we bring up the problem of unsupervised learn- 
ing. With reference to the perspective of learning as a way of delving deeper 
into the explaining function of a training set, mentioned in Sec. 3.7, here we 
assume that the explaining function is alternately picked from a set of func- 
tions. Each function is associated to a cluster of data. The problem of finding 
the best discriminating set of functions is dealt with in this book in an original 
way. Usually books frame the problem in terms of conditional distributions (see 
[Devroye et al., 1996]). Instead, we address the problems basically in terms of 
logical conditions, where the solution may draw an advantage from knowledge 
of the training set’s probabilistic features. A deep discussion of the related algo- 
rithms can be found in the literature. Specifically the Kohonen algorithms are 
widely discussed in [Kohonen, 2001], while Radial Basis functions are treated 
in [Poggio and Girosi, 1990a]. 

Having taken a learning capability to your PC, you may expect it (him?) 
to be very close to reasoning and deciding by itself. This is the wide field of 
Intelligent Systems, where crucial problems such as data mining [Pedrycz, 2001], 
industrial/economical competitions [Baba and Jain, 2001] or, not least, spam 
filters for your mailbox search solution. We expect that framing these problems 
still in the algorithmic inference framework will deserve interesting solutions. In 


302 Subsymbolic learning 


very essential terms, we pass from a mental scheme where God fixed the rules 
and then toss dice for giving us feeble insights, to an operational framework 
where rules are random functions that we have tools to manage [Apolloni et al., 
2002b] to hit our own target constituting the temporary unquestionably place of 
the universe (Copernicus, 1543]. 


Part Ill 


Appendices 


303 


A — Combinatorial calculus 


A.1 Managing configurations 


In the typical scenario of combinatorial calculus n objects and k places are given, 


n objects to fill k 


and we must count different ways of assigning objects to places. For the sake of P!#¢* 


simplicity, we will call configuration the result of an assignment. We will refer to 
assignments and configurations using letters of the Latin alphabet, respectively 
in upper and lower case. 


Example A.1. If an assignment O consists in putting the numbers ranging from 
1 to 6 in two different places, the corresponding configurations can be built 
according to two possible situations: 


1. It is not possible to assign a number to more than one position. The 
assignment and the outcoming configurations are said without repetition. 
Thus 0; = {1,4} and o2 = {2,3} are configurations of O, while {1,1} and 
{3,3} are not. 


2. It is possible to assign a number to more than one position. The assign- 
ment and the outcoming configurations are said with repetition. Therefore 
o1 = {1,4}, 02 = {2,3}, o3 = {1,1} and 04 = {3,3} are all configurations 
of O. 


The following Factorization principle holds: suppose we get an assignment O of 
the objects by performing two different assignments (say O; and O2) indepen- 
dently from each other. If the possible configurations of O, are distinguishable 
from the ones of O2, the total number of configurations of O is obtained by 
multiplying the number of configurations of O and Og. 


Example A.2. 


a) If O consists in throwing a die (O1) and simultaneously tossing a coin (O2) 
— both fair or not — then 


e the number of configurations in O is 6 = #{1, 2,3, 4,5, 6}; 


305 


in any way 


You may split the 
operation in 
many phases 


provided you can 
recognize their 
results 


306 Combinatorial calculus 


Table A.1: Classification of configurations generated by combinatorial operators. 


Objects 


Distinguishable Indistinguishable 


Dispositions 


(with repetition) 
Permutations 


(without repetition) 


2 
2 
g 
a 
a 
5 
bo 
= 
5 
2 
A 


Dispositions 


(without repetition) 


Combinations 
(with repetition) 


Single selection 


Combinations 


Indistinguishable 


(without repetition) 


e analogously, the number of configurations in Og is 2 = #{Head, Tail}. 


Therefore O has 12 = 6-2 = #{1/A Head, 1A Tail, 2^ Head, 2^ Tail,...,6A 
Tail} configurations. 


b) If O consists in tossing two coins, we can view the first and second toss as 
Oı and Og, respectively. Since the number of configurations in both O1 
and Op is 2 = #{Head, Tail}, the total number of configurations in O is 


e 4 = #{Head; A Headz, Head; A^ Tailz, Tail; A Headz, Tail; A Tailə } if 
the coins are distinguishable, and 

e 3 = #{Head ^ Head, Head A Tail, Tail A Tail} if the coins are indis- 
tinguishable. 


in terms of Assigning objects to places is the task of combinatorial operators that can be 
splitted in k successive subtasks: the first related to the assignment of an object 
to the first place, the second related to the assignment of an object to the second 
place, and so on. Thus the factorization principle induces a three-entry table to 
summarize the different aspects of combinatorial calculus: its cells are accessed 
depending on 


Managing configurations 307 


Table A.2: The possible dispositions of 5 objects a, b, c, d and e in 3 places. 


b 
b 
b 
b 
b 
d 
e 
d 
e 
b 


oa oaoaaanieoaoonaa 
vroaononvvvee 
oa oaaanaaoeoenaa 
aagenvppoapp mw 
oa oaagaeaaaa 
aageoeoenpoapp mw 
Foma O O 
aap onananp pp 
oa oaaaneaoeoenaa 
vroaononvvvee 
oa aoaaaneaovnaa 
v,rononvvrveoee 
aaqagoep»#poeapsp mp 
Tono ngogo go g 
aap oanp ppH 
oa oaaanaaoeoenaa 


a 
a 
a 
d 
e 
a 
a 
a 
d 
d 


a) the possibility of distinguishing among objects; 
b) the possibility of distinguishing among places; 
c) the independence between outcomes of the single tasks. 


We represent the cells in Table A.1, where the third entry is expressed through 
different contents in a same cell. 


Example A.3. Having subscribed to five on-line services (which we will hence- 
forth refer to as a, b, c, d and e), I expect to receive junk mail from some of 
them. I open the mailbox once in a week, on Sunday. I browse sequentially the 
messages in the same order they reach the mailbox and destroy the unsolicited 
messages. Moreover each provider is specialized as it concerns these messages: 
one offers miracolous pillages, one extraordinary proficuous investments, etc. 
The sequence with which I read the messages causes a different hurt to my 
patience. 

Let us consider the following situation: I know each provider will send me 
one spam message a week. Connecting to the net next Sunday, I can receive the 
first three unsolicited messages in how many configurations? 

In this case we are considering the assignment of five objects (a, b, c, d and 
e) to three places (the first, second and third spam message) without repetition. 
There are 60 possible configurations, as illustrated in Table A.2. To compute 
this number without listing all of them we can apply the factorization principle. 
The assignment O can be viewed as the results of three operations: 


O;: reception of the first message (filling the first place); this can happen in 5 
ways, since I may receive the mail from any of the 5 providers, i.e. a mail 
in any topic, as another distinguishing feature, of their favorite ones; 


Oz: reception of the second message; this can happen in 4 ways, since a 
provider has already sent his message; 


Os: reception of the third message; this can happen in 3 ways, since two 
providers have been sent their messages yet. 


objects picked 
places covered 


dependence 
between the 
phases. 


Complete 
information, 
constrained 
freedom 


Objects and 
places 
distinguishable, 
repetition 
forbidden 


Complete 
information, 
complete 
freedom 


308 Combinatorial calculus 


Table A.3: A different listing of the dispositions in Table A.2. 


ooonaanaonqnqeaqaqne%#sen»»#segeo#» 
aaov7cgnq0 709 ATO LABA 
vv v7 ovo oOo oO oO OD 
an0nnpov7npoaonPpPanD? 
aaqajaagaaaaaaaasa 
cocoon anqaaveveoeynpsp p 
qovrp oa 7PM MA PAAT 
oo ooeoeovooeooo @ 
aagaaqaqagnqgoangavae#e»#e#npsp pw 
qvnvpnaoropnanpana dg} 


opoppppppppep pw 


c 
d 
e 
a 
d 
e 
a 
c 
e 
a 
c 
d 


(ota @ ome @ ams TO O GOPA O O 
oonmn oaoaavrvo#»#nv mp » 


a 
a 
a 
c 
c 
c 
d 
d 
d 
e 
e 
e 


Thus the total number of configurations is 60 = 5- 4- 3. 


In the more general case: 


Fact A.1. We call disposition without repetition a configuration of n distin- 
guishable objects on k distinguishable places, with n > k, where each object may 
fill at most one place. The number d(n, k) of these dispositions is given by: 


d(n,k) = n(n — 1)(n — 2)...(n—= k +1) = (A.1) 


(n —k)! 


where n! is the factorial of n € N, defined as 


nl = I? ifn>0 (A.2) 
1 ifn =0 


i.e. the product of the first n natural number, 0 excluded. 


Applying the above composition rules gives the same list of dispositions, but 
sorted in the alternate order reported in Table A.3. 


Example A.4. Continuing the previous example, I know each provider will send 
me spam e-mails without limit on their number. In how many configurations 
can I receive the next three unsolicited messages? 

We are still considering the assignment of five objects to three places, with 
the difference that now repetition is allowed. In this case we will not explicitly 
list all the 125 different possibilities, such as a a a, e e e, a a e and so on. 
This number indeed can be obtained again through the factorization principle, 
which in this case reads: 


Managing configurations 309 


Table A.4: The possible permutations of 3 objects a, b and c. 


O71: 
O»: 


reception of the first message: this can happen in 5 ways; 


reception of the second message: this still can happen in 5 ways (since the 
provider who sent the first message could have sent the second one, too); 


O3: reception of the third message: this still can happen in 5 ways (since 
providers having possibly sent one or both messages could have sent the 


third one too). 


Thus the total number of configurations will be 125 = 5.5.5. 


In the more general case: 


Fact A.2. We call disposition with repetition a configuration of n distinguish- 
able objects on k distinguishable places, with n > k, where each object may fill 
more than one place. The number D(n, k) of these dispositions is given by: 


D(n, k) = në (A.3) 


Example A.5. With same notations as before, I know who are the three 
providers which will send me one spam message in the next week, while the 
remaining two will suspend their service. Nevertheless, at the end of next week, 
I can have received the three unsolicited messages in how many configurations? 

In this case we are dealing with configurations where three prefixed providers 
(say a, b and c) appear in any order. From another perspective, these providers 
are indistinguishable, since I have no way for deciding their selection by my- 
selve. Since the number of actually available providers is 3, this accounts for 
considering all the d(3,3) = 6 dispositions without repetition of three objects 
in three places. Their complete list is given in Table A.4. 


Fact A.3. We call permutation a configuration of k indistinguishable objects on 
k distinguishable places, where each object may fill only one place. The number 
a(k) of these permutations is given by: 


alk) = d(k, k) = k! (A.4) 


Objects and 
places 
distinguishable, 
repetition allowed 


Information only 
on the places, 


freedom 
constrained 


Objects 
indistinguishable, 
places 
distinguishable, 
repetition 
forbidden 


Information only 
on the objects, 


freedom 
constrained 


Objects 
distinguishable, 
places 
indistinguishable, 
repetition 
forbidden 


310 Combinatorial calculus 


Table A.5: Properties of the binomial coefficient. 


Example A.6. Now I know each provider will send me one spam e-mail a week. 
Since I committed my browser for sorting the messages by subject, now the 
sequence is locked. Connecting to the net at the end of the week, I can receive 
first three unsolicited messages in how many configurations? 

Let us come back to the assignment O before the installation of this browser 
facility. The d(n, k) configurations may be obtained by splitting O in the fol- 
lowing two operations: 


O,: selection of k from among the n subscribed providers, regardless their 
order. This means that me or any other on my stead has no way for 
deciding on each session by himself the order with which the messages 
will be read. We will refer to the number of these choices as the number 
c(n, k) of combinations of n elements in k positions; 


Oz: permutations of the k selected messages in k browser’s slots. This can be 
done, as described above, in a(k) ways. 
Therefore d(n, k) = c(n,k)a(k), hence the number c(5,3) of possible configura- 
tions is 10. 


Fact A.4. We call combination without repetition a configuration of n distin- 
guishable objects on k indistinguishable places, where each object may fill only 
one place. The number c(n,k) of these combinations is given by: 


d(n, k) n! n 
C = Sas = Gam = (4) ve 


The expression (E); known as the binomial coefficient, appears in the Newton 
formula of binomial expansion 


(a+b)? = 3 (") abnt (A.6) 


From the definition of (j') and the Newton formula we can derive the prop- 
erties reported in Table A.5. Algorithm A.1 generates the list of combinations 


Managing configurations 311 


Algorithm A.1 Generating the list of combinations without repetition of n 
objects in k places[Lehmer, 1964]. 
1. arrange the n elements in any order; 


2. build a list composed by n — k 0’s followed by k 1’s; 


3. select, with reference to the order introduced in 1, the elements corre- 
sponding to 1’s in the list, and output them; 


4. find the rightmost position x, in the list occupied by a 0 and followed by 
a 1; if no such a position exists, STOP; 


5. swap the positions xs and #541; rearrange the positions to the right of 
541 in such a way that all 0’s precede 1’s; GO TO 3. 


Table A.6: The combinations of a,b,c,d,e in 3 places generated by Algorithm A.1. 


até latta |€ cad 
bcdļjabejļace 


bceļjadelļbde 
cde 


without repetition of n objects in k places. Table A.6 shows its output for n = 5, 
k=: 


Example A.7. In this case I know each provider will send me spam messages 
without limit on their number, and I installed the above browser sorting facility. 
Connecting to the net, I can receive the next three unsolicited messages in how 
many configurations? 

In this case we find it useful to describe the configurations in a slightly 
different way, writing the symbol of five providers in a fixed order (say, a b c d 
e) and placing on the right of a symbol one or more copies of a fixed character 
(say, an asterisk *) to identify one or more messages from the corresponding 
provider. For instance the string 


Cunan 
describes the configuration corresponding to one message sent from the first 
provider and two messages sent from the third one. 
Thus selecting a particular configuration identifies with selecting 3 positions 
from among the last seven listed above (as the first position must always be 
filled with a letter). This means that we have (5) = 35 different configurations. 


Information only 
on the objects, 
complete 
freedom 


Objects 
distinguishable, 
places 
indistinguishable, 
repetition allowed 


Lack of 
information, i.e. 
objects and 
places 
indistinguishable 


You get 
probability only if 
you count 
correctly 


312 Combinatorial calculus 


Table A.7: Combinatorial coefficients. 


Distinguishable Indistinguishable 


Distinguishable 


o 
= 
a) 

S 
g 
Q 

5 

ao 

f] 
iz 
A 
T 

q 
tzi 


Fact A.5. We call combination with repetition a configuration of n distinguish- 
able objects on k indistinguishable places, where each object may fill more than 
one place. The number C (n,k) of these combinations is given by: 


C(n,k)=c(n+k-—1,k) (A.7) 


Example A.8. I know each provider will send me one or more spam messages. 
I installed a spam filtering facility which removes automatically the junk mes- 
sages. It just notifies me that it removed 3 messages. In this case I may consider 
one sole configuration, as Im not able to distinguish neither the sender nor the 
sending sequence. 


In light of the above facts, we can restate Table A.1 in terms of the involved 
combinatorial indices, as illustrated in Table A.7. 


A.2 Taking samples probabilities 


In the equiprobable model (see Sec. 1.2.3), the probability of a special event A 
is given by the ratio between the number n4 of elementary events constituting 


Taking samples probabilities 313 


A (say the number of possible events verifying A) and the total number not of 
the elementary events. Namely: 
P(A) = 4 (A.8) 


Ntot 


Counting these two numbers may result in a delicate job. Some help comes us 

if we can describe the elementary events as configurations within the taxonomy 

in Table A.1. This happens for the basic model of the Bernoullian sampling is The Bernoullian 
the following. ani: 


Example A.9. Given a urn with N equiprobable balls, K red and N— K white, I 
consider the experiment of drawing n balls with replacement, i.e.: I draw the ball 
— each with the same probability by definition — observe its color and then put 
it back in the urn. I wonder about the probability that k of the observed balls 
are red. Now, each set of n observations has the same probability of occurring. 
Hence, I can compute the wanted probability as the above ratio having any Sampling with 
observation set as elementary event. Since it constitutes a disposition with "Placement 
repetition of N objects on n places, the possible configurations are in number 
of D(N,n). Note that what I’m actually able to recognize is the color of a ball. 
Nevertheless, since 7 can draw one or another ball, it means that every ball is 
distinguishable by me. Idem for the places. 

This is for the denominator. As for the numerator I consider only the dis- 
positions that satisfy the property I’m focusing on, i.e. on the dispositions 
containing k observations of a red ball. I can split the assignment generating 
these dispositions in three operations: 


e O,: I decide the places, within the n observations, where the k red balls 
will be located. I can do it in c(n, k) ways. 


e O2: I fill up the selected places with red balls. I can do it in D(K, k) 
different ways. 


e O3: I fill up the remaining places with the white balls. I can do it in 
D(N — K,n — k) different ways. 


Summing up, the numerator equals c(n,k)D(K,k)D(N — K,n — k) and the 
whole ratio A.8 reads: 


P(k:n, K, N) = —_— = a a (1 = xy (A.9) 


which, for p = © reads as in (1.21), where the same expression has been con- 
structed through incremental modeling. 
The above may represent a model for solving your key question: “how many 
junk messages will I receive within the next n arrivals?”. Actually you rely 
on a long file of N messages, K of which proved junk. In the hypothesis that 
nothing will change in the next future, you may assume the next n messages Sampling without 
as a sample within the N previous ones. The fact is that you will not reinsert ‘cement 


Places for free 


314 Combinatorial calculus 


in the heap the messages already extracted. Rather you throw them once read 
in the basket. So, to be more accurate you must consider dispositions without 
repetition. The sole drawback is that you obtain a formula more long to be 
computed. Indeed, now the above ratio reads: 


(") K! (W-E)! À Cy EE) 
P(kin, K, N) = Se = Sih nok? (A.10) 
(N-n)! eo 


that is the probability distribution according to the hypergeometric model (see 
(1.22). 


As a matter of fact, with N and K going to infinite the two models (binomial 
and hypergeometric) coincide. 

Summing up, working with combinatorial operators we start from the out- 
come space made up of the dispositions with repetition of n objects on k places 
by default. Then, in some cases we can move to less crowded outcome spaces. 
Thus, we may come to the dispositions without repetition if the model inhibits 
the assignment of an objet to more places. Further, we can move to combina- 
tions of the objects if their place is inessential, in the sense that both elementary 
events in the numerator and in in denominator replicate k! times thanks to their 
permutation on the places. 


Example A.10. (Winning in the national lottery) What is the probability of 
matching three numbers in the Italian national lottery (the game consists in 
betting on subsets of 5 numbers picked at random without replacement from 
1 to 90)? In this case the outcome space is the set of all dispositions of 90 
numbers on 5 places without repetition, thus consisting of d(90,5) elementary 
events. The number of favorable events coincides with the product of c(5, 3) 
(the possible ways we can choose the positions in which to place the matched 
numbers) times d(3,3) (the possible ways we can assign matched numbers to 
positions) times d(87,2) (the possible ways we can assign unmatched numbers 
to the remaining positions). Thus 


P(three matches) = 


sia o! Set 1 
3219: 85! 
= 3A 85 _ _ = 0.00008512 (A.11 
D 11748 ( 


We can obtain the same result by collapsing all permutations of the picked 
numbers in a unique configuration, thus considering all the c(90,5) = (22) com- 
binations without repetition as number of equiprobable events. Then we count 
the number of favorable events as the number of combinations of the 90 — 3 


spare numbers in the two unlucky places, i.e. c(87, 2) = (87), since we have just 


Taking samples probabilities 315 


one way of arranging the matched number on the selected places. Therefore the 
probability we want is 


87 87! 
IGET 1 
P(three matches) s a mo~ 0.00008512 (A.12) 
5 51851 


It is worthwhile to note that using the combinatorial indices allowing repetition 
would not have led to the same results, nor to a good approximation of them. 
Indeed, the first mode of reasoning would give as result 


D D 2 
P(three matches) = H — 
33°87? 841 


St =), 461 (A.1 
90° 2430000 a a eels) 


Analogously, second mode would lead to 


C(87,2) 


P(three matches) = (90,5) 


= 0.00006974 (A.14) 


We come to the same model if the N messages have been univocally sorted 
according to some automatic index (like the one computed by spam filter con- 
sidered in Example A.8). 


The following example definitely requires considering dispositions and not com- 
binations in order to get probability. 


Example A.11. (The birthday problem) What is the probability that no one ina Places crucial 
class of 200 students will celebrate his birthday on the same day as someone else 

in the class? Let us assume non-leap years and birthdays uniformly distributed 

over the year. In this case the sample space is composed by D(365, 200) events 

(the possible ways to assign 365 days of the year to 200 birthdays). Analo- 

gously, there are d(365, 200) favorable events, which correspond to the ways we 

have of listing 200 birthdays without repeating any entry. Thus the requested 
probability is 

d(365,200) %3 


P (no common birthday) = D(365, 200) = 365200 = 1.61 x 1072 (A.15) 


This event becomes more probable in a class of 30 students, as 


d(365, 30) 


P (no common birthday) = D(365, 30) 


= 0.2937 (A.16) 


316 Combinatorial calculus 


Coming back to In the first computations in Chapter 1, those concerning the probability 
ensembles definition 1.5, the set of elementary equiprobable events is constituted by the 


permutation of the elements of an observed string of data. 


Example A.12 (Playing with dice). Two dice are thrown. If tı and t2 denote 
the score we get respectively from the first and second die, for an unknown 
te {0,..., 12}, define 


a ee (A.17) 
0 otherwise 


b “43 fti tta >t (A.18) 


0 otherwise 


Consider a sequence (01, 02,...), where for each k € N the pair (02441, 02442) 
refers to a same throw. This sequence is ordered, in that for each k € N we 
necessarily have 

O2k4+1 Š O2k+2 (A.19) 
If the sequence (0,0,0,1,0,1) is observed, what is the probability that the next 
observation is 07 = 1? Since the sequence is sorted as explained above, o7 = 1 
implies og = 1. The augmented symmetry ensemble (see Definition 1.5) is made 
up of the 24 permutations of the 4 pairs {00,01,01,11}. The set of favorable 
events is constituted by the 6 permutations of the first 3 elements followed by 
the pair 11 as requested by the event specification. This gives a probability: 


Plor=1)=3 (A.20) 


re i is our Note that the first three pairs univocally correspond to pairs of dice throws 
makes the {00, 01,01} coded through (A.17) on each throw. Thus, opening the observed 
results, in the assumption that the dice are equal, we may conclude that we have 
observed a string of 2 ones and 4 zeros. Moreover the hypothesized original pair 
11 translates in the fact the the seventh throw is coded 1 and the eighth is a 
don’t care since may be either 0 or 1 without changing the result of og. After 
this analysis we learn that the probability of the result 11 of the hypothesized 
pair 07, 0g is much higher since it is 


P(o =1)=? (A.21) 


Things are worse if we forget the sequence of the observed bits, but try to 
reconstruct the symmetry ensemble through local sorting operation producing 
admissible strings, though not resulting from the permutations of what we have 
observed. Namely we look for a symmetry ensemble made up of every possible 
sequence containing four 0’s and four 1’s and satisfying the above inequality 
(A.19) . There are three possibilities for assembling this sequence: 


1. considering four times the pair (0,1). As a permutation of the 0’s does 
not change the sequence (idem for a permutation of the 1’s), there are 
a(4)a(4) = 4!4! = 576 such sequences; 


Taking samples probabilities 317 


2. considering sequences consisting of one pair (0,0), one pair (1,1) and two 
pairs (0,1). As (1,1) can be placed in 4 different positions in the sequence, 
the two (0,1)’s can be placed in c(3,2) different positions and, as above, 
0’s and 1’s can be exchanged, there are 4c(3, 2)a(4)a(4) = 4(5)4!4! = 6912 
such sequences; 


3. considering sequences containing two pairs (0,0) and two pairs (1,1). As 
(0,0)’s can be placed in c(4,2) positions in the sequence and, as above, 
0’s and 1’s can be exchanged, there are c(4,2)a(4)a(4) = (3)4!4! = 3456 
such sequences. 


Summing up, there are 576 + 6912 + 3456 = 10944 possible sequences. The 
favorable ones are found in the subset of the sequences in point where the 
pair (1,1) appears at the end of the sequence. Thus they are in number of 
c(3, 2)a(4)a(4) = 1728, and the probability we compute is 

1728 3 


Poci leg A. 
(or = 1) = toa = Ig (2?) 


that is definitely less than both previous values computed for the event. 


Three tests 


e Compute the probability that tossing a fair die four times, the outcomes 
will appear in increasing order. 


e Suppose to have a program generating randomly natural 6-digit numbers. 
Compute the probability of tossing a number with exactly 3 same figures 
and the probability of tossing a number with at least 3 same figures. 


e I am playing a low stakes poker with three friends of mine. While the 
dealer is giving the initial five cards, I wonder which is the probability to 
get a full hand, a three and a poker. Could you help me? 


B- Random variables 


Given a probability space (Q,%,P), a random variable X is a bijective function 
from the elementary events of Q to an enumerable subset of R; its range may 
be expanded to the whole R to account for X continuous (see Definition 1.12). 
In this appendix you find strict definitions and essential tools for dealing with 
such variables. We revise both the functions associating probabilities to the 
values a random variable may assume and some synthetic parameters of this 
association, such as mean and standard deviation. Then we list the most com- 
mon random variables we meet for modeling a phenomenon (the remaining ones 
generally represent functions of them). To give the list an operational value, to 
each variable we devote a sheet containing: i) the analytical description of its 
distribution law, ii) its synthetic parameters, iii) the sampling mechanism for 
generating values of the variable, and iv) tools for estimating the law parameters 
from a sample of values observed on the variable. 

This covers the basics. In two further sections we will synthetically describe 
tools for: a) jointly dealing with more than one random variable; and b) man- 
aging new random variables computed through functions still having random 
variables in their arguments. 

We use many examples to render the reader familiar with these notions, while 
a small number of exercises are left him to check the comprehension of the most 
elementary concepts. As we are discussing of real management of data, we 
generally conclude the examples with numerical results. Since real numbers are 
assumed in this book to be a suitable approximation for representing actually 
rational ones, in turn we report these numbers with a certain approximation. 
Though the proper number of significative digits depends on the operational 
framework, we fix it to 4 along the entire book. We also hide the metering units 
of the involved variables as soon as the clarity does not suffer. 


B.1 Distribution laws 


Definition B.1. The cumulative distribution function (c.d.f.), or simply the dis- 
tribution function of a random variable X, is defined as: 

Fx (x) = P(X <2) (B.1) 

| 


319 


Cumulative 
distribution 
function 


320 Random variables 


Fig. B.1: A discrete distribution function Fx. Bold lines show the plot of Fx, while 
the dashed segment denotes the probability associated to event X = 2. 


Check: A c.d.f. is a function g : R > [0,1] such that: 

1. imgs g(x) = 0; lim, 4.00 g(x) = 1; 

2. zı < £2 => g(x1) < g(x2), ie. g is a non decreasing monotone function; 
3. limp_.9+ glx + h) = g(a), i.e., g is continuous from the right. 


_ Di- Definition B.2. Denoting Dx the set of values X may assume (as a result of a 
crete/continuous : ; PAS : 

random variables Mapping from the outcome space of a underlying probabilistic experiment), we 

qualify X discrete if Dx is enumerable, continuous otherwise, with a special dis- 

tinction of mixed variables where Dx may be partitioned into both enumerable 

and non enumerable subsets. As mentioned in Definition 2.1 Dx represents the 

sample space of the variable, i.e. the outcome space of the experiment consisting 


in drawing X. 
a 


Property (3) in the check of Def. B.1 is specially relevant when we deal with 
discrete random variables. If you consider, for example, the distribution function 
drawn in Fig. B.1 property (3) means that in x = 2 it takes value 0.4 (but value 
0.15 immediately before x = 2) corresponding to the definition Fy (2) = P(X < 
2). 

The non cumulative association of probabilities to values of a random vari- 
able has a different meaning depending on whether it refers to a discrete or a 
continuous variable. Nevertheless we will denote this association with the same 
symbol f, possibly speaking of density function in both cases; or alternatively 
giving this name to f referred to a continuous variable and calling probabil- 
ity function an f referred to a discrete variable 1. This will allow us to unify 
notations of many sentences referring to both kind of functions. Therefore: 


tin the last case this function has been denoted also by P elsewhere. 


Distribution laws 321 


Definition B.3. The probability function (p.f.) of a discrete random variable X 


is defined as: 
P(X =x) ifreD 
fx(z)= ( ) A (B.2) 
0 otherwise 


Check: A p.f. is a function g : R — [0,1] such that for enumerable I: 
1. g(a) >0 Wiel: 
2. g(z)=0 VaeAxAavie I ?; 
3. Vier g(vi) = 1 


Referring to Fig. B.1 fx(2) = Fx(2) — Fx(1) = 0.25 which is represented 
by the height of the dashed step. If we list the elements in Dx in increasing 


order {21,...,2;,...} 3 , the relationships between distribution and probability 
functions of X are: 
fx (ai) = Fx (xi) E Fy (zi-1) Viel (B.3) 
fx) = 0 VaFaViel (B.4) 
0 if £ < z1 
Fxy(xz) = Der fx(£r) ifa <2 < srypifi+1 ET (B.5) 
1 otherwise 


To facilitate the perception of the algorithmic counterpart of the definitions, 
in the rest of this appendix we will assume by default the mentioned indexing 
monotonicity and will denote with v the (possibly infinite) cardinality of J, 
so that Dx = {z1,..., £y}. For any x we denote with i, the index i such that 
xi < x < x41. Note that we use a particular typeface to distinguish this indexing 
from the one enumerating values in a generic set of specifications {71,...,U%m}. 


Definition B.4. The density function (d.f.) of a continuous random variable X 
is the derivative of its c.d-f. 4 

= dFy (x) 
© dg 


fx (x) (B.6) 


Check: A d.f. is a function g : R —> R* such that: 
1. fees g(x)dz = 1 


2using the symbol V for denoting the universal quantifier “for each”. 
Shence with a < aj41Vi 


4with caveats mentioned after Definition 1.15. 


Probability 
function 


d.f. versus c.d.f. 


Values versus 
indices 


Density function 


Mixed random 
variables 


Quantiles and 
percentiles 


322 Random variables 


Fig. B.2: (a) Geometric meaning of the distribution function. Line: plot of a d.f. fx. 
Gray area: probability of the event X < 2. (b) Line: plot of the corresponding c.d.f. 
Fx; the height of the gray bar has the same value of the gray area in (a). 


Given the density function of a continuous random variable X it is always 
possible to compute its distribution function by integrating (B.6): 


Fe(a)= f Ee (B.7) 


Equation B.7 also suggests the geometric meaning of the cumulative distribution 
function: it represents the area subtended by the density function between — oo 
and « as shown in Fig. B.2. 

We cannot always represent an experiment by a random variable which has 
a unique behavior: either continuous or discrete. Consider for instance the 
random variable expressing the idle period of a car subjected both to systematic 
maintenance lasting a fixed amount of time 7 and to shorter repairs of time 
lengths uniformly spanning the interval (0,7/2). A representative distribution 
function of these idle times, with some incongruence in the scale, is shown 
in Fig. B.3. The density function can be evaluated as the derivative of the 
cumulative distribution where it exists, while in the mass points it is computed 
as fx(x) = P(X =x). We cannot represent the two functions with a same scale 
since either the function in the continuous part goes to zero and the function 
in 7 has a finite non null value or the former has a finite non null value and 
the latter goes to infinite. In this appendix we will make explicit reference to 
discrete and continuous random variables. Extensions to the mixed one are 
instead left to the reader. 


Definition B.5. For q € [0,1] we define as q quantile of a random variable X, 
denoted by €,, the smallest number € such that Fx (€) > q. If q is expressed as 
a percentage, it is called q percentile. 


Distribution laws 323 


fx 


aj- nj- 


Fig. B.3: A naive representation of a mixed distribution function. 


Remark B.1. In case of continuous random variable the above condition sim- 
plifies in: €, is such that Fy (&) = q. 


Example B.1. 


e In the Italian population the 95 percentile of the S-cholesterol is 200 mil- 
ligrams per deciliter, which means that 0.9500 is the probability that an 
Italian person has in his blood a S-cholesterol rate less or equal than the 
above value. 


e As well known from the FAO reports, the 0.8000 quantile of the random 
variable “wealth of a world inhabitant” is a value r such that the wealth 
of the people above this threshold cumulates over 70% of the world riches. 
This means that 20% of humans owns over 70% of the riches. 


The main way for computing synthetic parameters of a random variable 
passes through the expected value operator. Namely 


Definition B.6. The expected value E|X] of a random variable X is defined as: Expectation 
e E[X] =o, wifx(ai), for X discrete; 
e EX] = joes xfx(x)dz, for X continuous; 


e E|X]= Ea — Fx (x))dz — a Fx (x)dz, for any X. 


Elementary 
algebra of 
expected values 


324 Random variables 


Fig. B.4: Relation between the expectation and c.d.f. of a random variable. Line: plot 
of the c.d.f. Difference between the light and dark gray area: expected value. 


Remark B.2. The last expression suggests the nice representation of E[X] as 
the difference between the shaded areas in Fig. B.4. 


More in general we can evaluate the expected value of a function g : R —> R of 
a random variable X as follows: 


e Elg(X)] = Xia g(zi) fx (ai), for X discrete; 
e Elg(X)] = [ae g(x) fx (x)dax, for X continuous. 


From Definition B.6 we can easily see that the expectation is a linear function, 
therefore it is: 


ElaX + bY + c] = aE|[X]+bE[Y] +c 
Va,b,c € R,VX,Y random variables (B.8) 


Example B.2. The probability of my train arriving late is p = 0.05. Consider 
the random variable X assuming the value 1 if I, using the train, will arrive late 
and 0 otherwise. Its expected value will be 


E[X] = 1p + 0(1 — p) = p = 0.05 (B.9) 


Next week I will take the train every workday. Let Y be the random variable 
counting the number of times I will be late. As its value can be expressed 
in terms of the sum of random variables Xmon, Xtue, Xwed, Xthu and Xfri 
accounting for the daily delays, each ruled by (B.9), we have 


E|Y = E|Xmon t Xtue H X wed H Xthu H X yi 
= E|Xmon] + E[Xtue] + E[Xwea] + E[Xtnu] + E[Xei] = 5p = 0.25 (B.10) 


Going through the delay causes, suppose that each weekday my transportation 
company uses a different train chosen from among 5 and that one of them is 


Distribution laws 325 


older and takes more time to get to its destination (thus arriving certainly late). 
If at the beginning of each week trains are randomly scheduled by specific days, 
on each day the probability of having a delay will be q = Ł, If I am late my job 
is diverted to another guy for the rest of the week. If we consider the random 
variable Z counting the number of days I work in a week, we have 


E|Z] = i sees (B.11) 


A family of functions g used to account for the shape of a distribution law of X 
is constituted by the powers of x. Their expected values are called moments of 
the random variable. If the origin of the X axis is centered on its average we 
speak of central moments. Namely: 


Definition B.7. The (regular) moment and central moment of order r of a 
random variable X are defined respectively as: 


p(X) = BX") (B.12) 
p(X) = EX- E[X))"] (B.13) 
E 


The moment of order 1 of X coincides with its mean value EX] (elsewhere 
denoted ux); the central moment of order 2 with its variance V[X] (usually 
denoted o% ). 


Definition B.8. The variance of a random variable X is defined as: 
e V[X] = X; (zi — E[X])? fx (ai), for X discrete; 


© V[X] = [72 (æ — E[X]} fx (x)dz, for X continuous. 


A very useful relation between the variance and non central moments of a ran- 
dom variable is: 


V[X] = E[(X — E[X])?] = ELX?] — E[X}? (B.14) 
The variance is a non linear operator for which 
V[aX +b] =a?V[X] Va,bER (B.15) 


while V[X +Y] = V[X]+ V[Y] if X and Y are independent, as we will see in 
Sec. B.4. 


Moments of 
random variables 


Variance 


Two highly 
dependent 
random variables 


Standard 
deviation 


Building a sample 
by ourselves 


The congruential 
generator 


326 Random variables 


Example B.3. Consider the two random variables X and Y representing in liters 
what is taken and what is left in a tank of 10 liters of water, in the case where 
we are allowed to draw a random quantity that may uniformly range from 0 to 
10 liters. Thus both X and Y are uniform variables in [0,10]. From page 342 
we learn that their mean equals 5 so that E[X +Y] = 10, as expectable. But 
the variables are not independent, as X = 10 — Y. Indeed the sum is constant, 
so that its variance equals 0, despite the fact each variable has variance 100/12 
(same page). 


Definition B.9. The standard deviation ox of arandom variable X is the square 


root of its variance: 
ox = VV[X] (B.16) 
E 


The main benefit of the standard deviation in respect to the variance V[X], is 
that the former has the same dimension as the random variable. 


B.2 Computing a sample of i.i.d. random variables 


A sample of independently identically distributed (i.i.d.) random variables (see 
Definition 2.1) constitutes a set of values whose frequencies converge, with the 
size of the set, to their probabilities as specified by a c.d.f. Fx and such that the 
user cannot predict the value x of one element of the set from the knowledge of 
the other’s with a probability of success significantly greater than probability 
P(X = x) coming from Fy. Here below we provide algorithms for computing 
i.i.d. random samples. This operation is usually denoted as simulating a random 
variable, whereas we prefer to call generation the computation and simulation a 
possible task fulfilled through it. These data satisfy the above conditions in the 
sole hypothesis that the generating algorithms remain hidden to the end user of 
them. 


B.2.1 Samples from a uniform continuous random variable in 


[0,1] 


A very simple way of generating a variable X uniform in [0,1] is represented by 
the procedure described in Algorithm B.1. 

Check: nz, is the remainder of the division of anaz,_, by n. Thus it ranges 
between 0 and n, which makes x, ranging from 0 to 1. As the sequence is 
completely determined by its starting point and by the two parameters n and 
a, We: 


e choose the initial value in an actually unpredictable way by wrapping 
in the interval (0,7) a potentially high number connected with a co- 
occurrence of a lot of other events (68 years ~ 231 — 1 seconds) and divide 
it by n; 


Computing a sample of i.i.d. random variables 327 


Algorithm B.1 Generation of a continuous uniform random variable in [0, 1]. 
set: m = sample size; n = 2°! — 1; a = 7°; x =local time in seconds from the 
date of an uncorrelated event not farther than 68 years (for instance the last 
connection to the National Library) divided by n. 


fork =1tom q 
ag = #2 modn (B.17) 


n 


e set a to a number relatively prime with n in order to inhibit circularities 
in the remainders story; 


e set n very high in order to avoid saturation of the remainder sequence. 
Actually also in the best case of having as many different values as possible, 
after n trials the sequence must pass throw a previous value and then 
repeat the log. 


While there is no algorithm promptly inverting the simple recursive function 
(B.17) for any selection of the above parameters, we have no theoretical results 
ensuring the unpredictability of the sequence, for a good selection of them. 
On the contrary, we know of some striking cases where even a complanarity 
condition exists among the data 5. However, although more robust ones are 
available in the literature, Algorithm B.1 is widely used without inconveniences 
for simulating random variables. 


B.2.2 Computing c.d.f. from data 


A synthetic description of data coming from a probabilistic experiment — for in- 
stance produced as before — is given without information loss by the cumulative 
frequency of the values (called empirical c.d.f. elsewhere), i.e. by the function 
F. defined as follows: 


A 1 m 
F(a) = = baer ica (B.18) 
i=1 


where J denotes the indicator function introduced in Definition 1.21. Thus P 
computes the frequency of sample values actually smaller than x. Figure B.5 
shows that this function is well approximated by its asymptotic expression (see 
page 342) 

Fy (a) = xIio,1] (x) + Tey. ac0y(2) (B.19) 
The approximation increases with the sample size passing from 10 to 100 and 


1,000, which exhibits a good compliance of the procedure with the i.i.d. re- 
quirements. 


5for instance they lie on a spiral in a hyperplane whose dimension is less than m [Ripley, 
1987]. 


Generally it 
works, 


otherwise we use 
smarter 
algorithms. 


Cumulative 
frequencies to 
synthesize data 


328 Random variables 


Fx, Fy Fx, Fx Fx, Fy 


(a) m= 10 (b) m = 100 (c) m = 1000 


Fig. B.5: Convergence of a sample empirical c.d.f. computed for a continuous uniform 
distribution to the c.d.f. of the same distribution. Black curve: c.d.f.; gray plot: 
empirical c.d.f.; m: sample size. 


B.2.3 Generating discrete random variables 


Thresholding U Let’s consider a generic discrete random variable X taking values in 
for computing af 71,22,...,2v} (with v possibly infinite) with probability pi = P(X = zi). We 
get a sample of X specifications {x1,...,2m} according to the inverse transform 


method through Algorithm B.2. 


Algorithm B.2 Generation of a discrete random variable with probability func- 
tion {p;;i=1,...,v}. 
set m = sample size; 
get a sample of specifications {u1,..., Um} of the random variable U uniform 
in [0,1] by running Algorithm B.1; 
fork =1tom 

assign to the specification x; the value x; such that 


i-1 i 
Sop <u < S00; (B.20) 
j=0 j=0 

with po = 0 


Check: This method outputs value x; with probability pi. Indeed, denoting 
with capital letter the random variables assuming specifications uz, and £k, 


i-1 i i i-1 
Pea) =P |S 7 < <> 9 | => n- a = 2 (B.21) 
j=0 j=0 j=0 jÆ 


Example B.4. Let Dx = {1,2,3,4}, with pı = 0.2, po = 0.15, pa = 0.25, 
pa = 0.4. The assignment statement in Algorithm B.2 reads: 


Computing a sample of i.i.d. random variables 329 


Table B.1: A set of 100 values returned by algorithm B.2. 


{3, 1, 3, 3, 4, 3, 3, 3, 2, 4, 3, 1, 1, 2, 4, 4, 4, 4, 1, 3, 4, 
4,1, 2, 3, 3, 4, 4, 4, 1, 1, 2, 3, 1, 3, 1, 3, 4, 4, 2, 4, 1, 4, 
4, 4, 3, 4, 3, 2, 4, 4, 4, 3, 4, 4, 3, 4, 4, 3, 2, 2, 4, 3, 4, 4, 
4, 4, 1, 4, 1, 1, 35. 4). 1, 1, 43.0252, 1,3, 3, 35.4, 1, Ly 4. 2; 
2,35 Lp 35 15 35 45.3535. 4, -45 2; 4} 

Fx, Fy 


Fig. B.6: Description of the values in Table B.1 through their empirical c.d.f. (gray 
plot) contrasted with the original distribution law c.d.f. (black plot). 


return 1; 
else if u < 0.35 
return 2; 
else if uz < 0.6 
return 3; 


else return 4; 


Table B.1 shows a set of 100 values for the specifications of X obtained 
through execution of the above algorithm. Figure B.6 shows the corresponding 


empirical c.d.f. A more efficient statement is the following: and a smart , 
implementation. 


return 4; 
else if u < 0.65 
return 3; 
else if u < 0.85 
return l; 


else return 2; 


Though the two statements are equivalent from a (worst case) computational 
complexity point of view, the latter will require a meanly lower number of 
comparisons due to the X probability distribution. 


Inverting U for 
computing a 
continuous 
variable 


and a quick 
implementation. 


330 Random variables 


Fig. B.7: Description of the values returned by (B.23), for a = 1 and b = 6. Gray plot: 
empirical c.d.f. for a sequence of 1000 elements. Black curve: c.d.f. of the uniform 
distribution law with the above parameters. 


B.2.4 Generating continuous random variables 


Let’s consider a generic continuous random variable X with c.d.f. Fx. We get 
a sample of X specifications {a1,...,2%m} according to the inverse transform 
method through Algorithm B.3. Check: This method outputs values x with 


Algorithm B.3 Generation of a continuous random variable with c.d.f. Fx. 
set m = sample size; 
get a sample of specifications {u1,...,Um} of the random variable U uniform 
in [0,1] by running Algorithm B.1; 
fork =1tom 

assign to the specification x, the value x such that 


Fx (x) = uk (B.22) 
that we denote Fy! (ux); 
end 
c.d.f. Fx. See (1.67) with Z = Xx. 
Example B.5. Let X be a variable with 


«ra 


Fae) b-a 


Iaa) + Io, +) (£) (B.23) 


(uniform variable, see page 342). The assignment statement in Algorithm B.3 
reads: 
Tk =a + uklb-— a) (B.24) 


Figure B.7 shows a comparison between the empirical c.d.f. of a set of 1,000 
values returned by this algorithm and the uniform distribution c.d.f., for a = 1 
and b = 6. 


Basic random variables 331 


B.3 Basic random variables 


The following sheets describe the most widely used random variables, listed 
in a logical order according to the progressive model derivation considered in 
Chapter 1. First we illustrate the discrete variables, then their continuous ex- 
tensions. The format is the following: we denote the random variable by X. 
First we report the analytical form of the cumulative distribution function Fx 
and the related either probability or density function fx. We visualize these 
functions through their graphs as well and synthesize their shape through the 
expressions of the mean and variance of X. Then we consider the sampling 
mechanism of X, which we use for generating a sample of 100 specifications of 
X whose cumulative frequency F (plotted in gray) we graphically compare with 
Fx (plotted in black). Then we define the sufficient statistics at the basis of our 
inference. With these statistics we: a) state a twisting arguments, b) determine 
the c.d.f. of the parameters, and c) determine unbiased estimators of them. The 
maximum likelihood estimators of the parameters are reported as well as the 
best representative of point estimator in Kolmogorov approach. As remarked 
elsewhere for a questioned parameter O, so is the observed statistic as a speci- 
fication of the random variable Se; sg is the value we would have observed with 
same seed of so but with specification 0 of Oasa specification of the random 
variable S5. By ceiling the parameter symbols with a hat or a breve, such as 
with Ê or 6 we denote the unbiased estimator and MLE of a parameter ©. m 
denotes the sample size, © the sample mean. 

With Normal random variable we denote a Gaussian variable having null 
mean and unitary variance. 

In case of multiparametric distribution, we consider the statistics for the sin- 
gle parameters, assuming the others known. Joint statistics for sets of parame- 
ters and related estimators are considered in the dedicated Examples mentioned 
in the bottom page note. 

In many cases the assign statement in algorithms B.2 and B.3 may be 
substituted by smarter subroutines that either apply the same c.d.f. inversion 
strategy in an efficient way or change completely the generation strategy. 


All what you 
want to know 


about | 
distribution laws 


For | n 
multiparametric 
distributions see 
specific examples 


332 Random variables 


B.3.1 Discrete random variables 
Discrete uniform distribution 


g n=l 
variance: 
12 


sampling mechanism 


Algorithm B.2 with assign routine: 


int uniformDiscrete(float u, int n) { 
return floor (n*u+1); 


shape of p.f. and c.d.f. (n = 10) 


twisting argument: 
(sa > sy) = (n < n) <= (sz > sy +1) 


unbiased estimator: 
+oo 


sy —1+(sy-1)™ ` 


i=SN 


maximum likelihood estimator: n = sy 


Discrete uniform distribution describes an experiment having n 
equiprobable outcomes. See Sec. 1.3.6 for the meaning of |x]. 


Basic random variables 333 


Bernoulli distribution 


parameters: p € [0,1] 


p-f.: fx(z;p) = p?(1 — p) ho} (2) 


c.d.f.: Fx(x) = (1 — p)lo, (£) + Iu, +%) (2) 


sampling mechanism 


Algorithm B.2 with assign routine: 


int bernoulli(float u, float p) { 
if (u<p) 
return 1; 
else return 0; 


shape of p.f. and c.d.f. (p = 0.2) 


m 
sufficient statistic: Sp = ye Xi 
i=1 
twisting argument: 
(sp > sp) = (p < P) = (sp 2 sp +1) 


parameter distribution: 
m 


mM \ zi m—i 
D (EO =H" 03) + Fa.40)@) = Fo) 
i=sp 


m ~i m—i 
> > @: (1 — p)” Io, y (P) + Ia, +œ) (P) 
i=sp+1 
unbiased estimator: 


maximum likelihood estimator: ğ = sp/m 


Bernoulli distribution describes an experiment which can either 
result in a “success” or in a “failure”. 


334 Random variables 


Binomial distribution 


parameters: n € N, p € [0,1] 


sampling mechanism 


Algorithm B.2 with assign routine: 


int binomial(float u, int n, float p) { 
int i,x$ 
x = 0; 
for (i=0;i<n; i++) 
x += bernoulli(u;, p); 
return x; 


shape of p.f. and c.d.f. (p = 0.2, n = 10) 
F, F 


m 
sufficient statistic: Sp = 5x 
i=1 
twisting argument: 
(sp = sp) = (p < P) = (sp 2 sp +1) 
parameter distribution: 
mn\ —, EP 
BBY" To .(B) + a) > Fe) 
mn mn 
2 ( i )ra = P)” oa B) + Ta, +) (P) 
i=sp+1 
unbiased estimator: 
é $ 1 
e pe SPT o 
mn +1 mn +1 


maximum likelihood estimator: ğ = sp/(mn) 


Binomial distribution describes the experiment of counting the 
number of successes in a sequence of n independent Bernoulli ex- 
periments, each with parameter p. We have no direct symbolic 
expression of either unbiased estimator or MLE of n. Rather a 
consistent estimator is: % = J`; xi/(mp) (see Examples 2.37 
and 2.38). 


Basic random variables 335 


Geometric distribution 


parameters: p € (0, 1] 
p-f.: fx (x; p) = p(1 — p)” Inuto (2) 


c.d.f.: Fy (a; p) = (1 —(1—p) le)+1) Tig,-+00) (£) 


Gai VD) 
variance: 
2 
p 
sampling mechanism 
Algorithm B.2 with assign routine: 


int geometric(float u, float p) { 
return floor(log(u)/log(1-p)); 


shape of p.f. and c.d.f. (p = 0.2) 


ee 
8.2.2 ee wee T x 
7.5 10 12.5 15 17.5 F 7.5 10 12.5 15 17.5 


m 
sufficient statistic: Sp = 5 Xi 
i=1 
twisting argument: 
(sp < sp) = (p < P) = (sp < sp — 1) 


parameter distribution: 
SP 


2 ( É i Hara — B)'I(0,1(B) + Ta,+00)(P) = Fp(P) 
-1 


P” (1 — D) Io, (B) + Ia, +) (P) 


Geometric distribution describes an experiment counting the num- 
ber of failures before the first success in an infinite sequence of 
independent Bernoulli experiments, each with parameter p. 


336 Random variables 


Poisson distribution 
parameters: u € Rt 


variance: p 


sampling mechanism 


Algorithm B.2 with assign routine: 


int poisson(float u, float mu) { 

int: i; 

float u, p, fp; 

i=0; 

p = exp(-mu); 

while(u >= fp) { 
p= mu * p / (itl); 
fp += p; 
itt; 

} 


return i; 


shape of p.f. and c.d.f. (u = 7) 


m 
sufficient statistic: Sm = SOX; 
i=1 
twisting argument: 
(sz > sm) = (u < A) = (sã 2 sm +1) 


parameter distribution: 


ee 


maximum likelihood estimator: ň = sm/m 


Poisson distribution describes the experiment counting the num- 
ber of occurrences of an event so rare that: i) the probability of 
two occurrences in a same time slot goes to 0 with the slot width, 
and ii) an occurrence does not bias the rest of the sequence. 


Basic random variables 337 


B.3.1.1 Examples 


Example B.6. Consider a random variable X describing the experiment of 
throwing a fair die one time. Since the sample space is Dx = {1,2,3,4,5,6} 
and each face of the die has a same probability to be thrown, X is a discrete 
uniform random variable of parameter n = 6, having expected value 


1 n+1 7 
= p= = = Å B.2 
E[X] a 5 5 (B.25) 
and variance ; 
—1 35 
X] = E[X?) — EJX}? = Z = B.26 
V(x] [X*] — E[X] T F (B.26) 
since 
2al 1)(2n+1 
pos i e i aa. (B.27) 
m 


Example B.7. The numbers {7,26,1,9,14,9,25,22,8,5} (generated by the 


Modeling a 
uniform 
distribution 


Estimating a 
uniform 


sampling mechanism on page 332 with n = 27) report a story of 10 throws of distribution 


a fair electronic die with an unknown number nı of faces, each stamped with a 
different number between 1 and nı. Using the values in the record to represent 
a possible sample of observations of this die we may estimate N, through its 
unbiased estimator (same page) within the interval 


27.3 < E[N] < 28.4 (B.28) 


while the maximum likelihood estimator is ñi = 2m) = 26 and the weakly 


unbiased estimator is £(m) mil = 28.6. 


Example B.8. Suppose we are interested in the probability of throwing a 6 at 
least k = 2 times on n = 5 throws of a six-faces fair die. If K is the number 
of trials with outcome 6, it follows a binomial distribution of parameters n = 5 
and p = 4. Then P(K > 2) = 1- P o GVO- 467) ~ FL Binomial 
distribution is studied extensively throughout the book: see Definition 1.10 for 
the definition of the related probability space, Examples 1.7 and 1.8 for the 
computation of its mean and variance, Fact 1.4 for the distribution of binomial 
variables’ sum. Example 2.10 shows the membership of this distribution in the 
exponential family, while Examples 2.7 and 2.38 deal with its sufficient statistics 
and maximum likelihood estimators, respectively. 

If we are interested instead in how many trials we must perform before throw- 
ing a 6, this random variable follows a geometric distribution with parameter 
p= E, The c.d.f. of a geometric distribution is analytically obtained by 

pes) wap e i (B.29) 
7 j=0 j=0 = 


Modeling and 
generating a 
geometric 
distribution 


338 Random variables 


where q = 1 — p. We avoid the length of this computation by just considering 
that the event (X > i) is equivalent to the event “all initial i events proved 
faulty”. A smart assignment statement in Algorithm B.2 is: 


ee ka (B.30) 


Ing 


and variously If we do not know the true source of the data, and after having observed a string 

aSa x = {5,3,5,6,2,4,1,3,5,4,6,6} we wonder what is the probability p that the 
next number will be 6, in the assumption that observed 2;s are a sample of a 
random variable X, the unbiased estimate of P is: 


(B.31) 


Note that if we also consider the sequence with which the numbers appear, 
rather the distances between consecutive 6s, we get more information in this 
case, having, according to page 335 


3 1 
a a ee B.32 
B = 4 B82) 


which denotes an interval included in the former one. This is quite a general 
result, as when the sample ends with a success we have, picking formulas from 
pages 333 and 335 


k a yk ee k 
— = (%+1+— <(F¥+1) =—< 
m+1 m m 


(B.33) 


where m’ denotes the size of the sample accounting for the distances of consecu- 
tive 6s. Things are different if we sacrifice some observations by truncating the 
sample after the last success (i.e. 6 is not the last number in the sample but we 
do not care about the residual records after that). 


Orienteering Example B.9. In an orienteering competition the race length is 10 Km with 20 
within a rare . . 

events framework Control points. On average, the runner misses at the first attempt one control 

point out of every five. Thus to compute the probability that he finds control 

point no. 15, we may consider an experiment consisting in a single trial with 


two possible outcomes: missing or not missing, occurring with probabilities + 


5 
and E, We describe it through a Bernoulli random variable X with parameter 
p= $. This means that we associate value 1 of X to the event “the control 


point has been found”, and 0 to the complementary event “the control point 
has been missed”. 

The probability of success increases if we are interested in finding at least 
one control point within the first 15 at the first attempt. In this case we may 
refer to a Binomial random variable Y of parameters n = 15 (the number of 
trials we are allowed — the first trial for each control point — in order to find a 


Basic random variables 339 


control point) and p = 2 (the success probability of the single trial). W.r.t. this 
variable the event of our interest is Y > 1, whose probability measure is 


pore EAHA R3) 


i=l 


more easily computed as 


P(Y >1)=1-P(Y =0)=1- j (B.35) 


To study local reaction to the checks the organizers decided to simulate a 
series of runner’s stories. The generation of a binomial random variable through 
the inverse transform method can be implemented in a smarter way than Algo- 
rithm B.2. From the recurrence relation on parameter p; = P(X = i) 


n—i p 


——_ —— pi B.36 
(=1i=9° ( ) 


Pi+1 = 
the assignment statement in Algorithm B.2 may be implemented through the 
routine described in Algorithm B.4. On page 334 we report an even smarter 


Algorithm B.4 Generating a binomial variable 
set fy = (1— p)"; 
Tk = 0; 
while(u;, = fy) 
fy = fyp/(1— p)(n — i)/(i + 1); 
Le = XR +1; 
return Tk; 


implementation of the generating algorithm. It follows from the remark that a 
binomial random variable Y of parameters n and p can be obtained by the sum 
of n iid. Bernoulli random variables, each with parameter p (see Sec. 1.2.3). 

After some time the runner, unable to find the right way to carry on, decides 
to stop some other competitors to ask for help. But on average only 1 runner 
out of 4 stops to help him. So he wonders how many runners will pass before 
he can get help. 

We may model the story of the runner as a sequence of independent Bernoulli 
experiments with success parameter p = 2, Therefore, the random variable X 
counting the number of persons who do not stop before receiving help is a 
geometric variable of parameter p = i. 

It enjoys an invariance property due to the fact that the number of persons 
who do not stop is independent of the number of persons who have already 
spoken with the runner. We may read it as an independence of future from past 
in the sense that the amount of time the runner must wait does not change if 
he has already seen 1, 2 or 100 persons who did not stop. 


340 Random variables 


After control point no. 15 the runner feels lost again. Then he stops in order 
to recognize his surroundings and realize that an average of 2 other runners pass 
him every minute. Planning for the future, he wonders about the probability of 
meeting at least one other runner in the next minute. We may model the number 
of people passing by the runner in the next minute with a Poisson variable K, as 
in principle we have no upper bound to this number, where parameter p is the 
average of this variable that we recognized to be 2. Therefore the probability of 
his interest is 

P(K > 1)=1-— Fg(0) = 1 — e7? = Fr(1) (B.37) 


where T, distributed according to a negative exponential law of parameter À = 2 
(see Sec. 1.3.4 and page 343), denotes the first passage time. Indeed 


P(no runner in time interval (0,t)) = P(K = 0) = P(T > t)=e7™_ (B.38) 
hence the c.d.f. of T will be 

Fr(t)=1- P(T >t)=1- e~ (B.39) 

a 


_ possibly with Example B.10. The owner of a video shop knows that, on average, 2% of the 

infinite events. i Jeo packages managed by video-dispensers hold a movie different from the one 
shown on the label. As of now, he knows that there are 200 videos in his shop. 
To get a quick idea of possible problems met by users he decides to estimate the 
probability that at most 5 packages contain a wrong movie. The idea is to check 
the items one by one if this probability is under 70%. The random variable X 
counting the number of questioned packages follows a binomial distribution of 
parameters n = 200 and p = 0.02. Thus: 


5 
2 
P(X <5)=)_ ( a 0.0270.98200-7 (B.40) 
«x=0 


Since this calculus takes too long, he decides to approximate the binomial dis- 
tribution with a Poisson distribution of parameter u = np = 200- 0.02 = 4 
obtaining: 
5 4®%e-4 
P(X <5)=> > apo = 0.785 (B.41) 
x=0 


Actually, the output of (B.40) is 0.787, thus denoting the suitability of the 
Poisson approximation (see Sec. 1.3.4). Vice versa, after an overall check he 
does in spite of the previous strategy, he finds a percentage of 6%, i.e. 12 
mislabeled packages out of 200. This suggests that customers of his shop are 
particularly impolite, since the 0.990 quantile of a Binomial distribution as above 
(or its Poisson approximation) is 9. 


Basic random variables 341 


Example B.11. The ticket office of a ferry company sold {84, 101, 103, 98, 91} 
car tickets in the last five working days. Assuming these numbers to be a sample 
of a Poisson random variable X with mean ux, the unbiased point estimator 
provided on page 336 gives an expected number of embarked cars: 


95.4 < fix < 95.6 (B.42) 


For the next week 45 extra places have been booked each day. Thus the director 
of the company must increment the number of trips to fulfill the service. Namely 
his boat contains 60 cars. Thus he plans to make three trips at one third (say 
ten o clock a.m.), two third (two p.m.) and end (six p.m.) of the 12 hours 
workdays. He wonders what is the probability R that no cars miss the next trip 
to its arrival time. 

Assuming the arrival time to be homogeneous along the workday, the pa- 
rameter of the Poisson variable Y describing the number of cars in each slot is 
u/3. The questioned R concerns the event that on each slot the number of car 
arrivals is less than 60. This number is made up of a constant part equal to 15, 
as one third of extra places booked per day. 


R= Fy (45; u/3)° (B.43) 
Concerning u we may estimate it through: 
e the unbiased estimator fi ~ 95.5 having R = 0.968 


e the quantile Mo.909 of the distribution law of the Poisson parameter as on 
page 336, having R > 0.924 with a confidence 0.900 


e the mean value of R|y, i.e. i ar Riufu(u)du, having R = 0.962 as a true 
unbiased estimate of R. 


Solving a 
scheduling 
problem on 
Poisson variables 


342 Random variables 


B.3.2 Continuous random variables 


Continuous uniform distribution 
parameters: a,b € R, with a < b 


c.d.f.: Fx (x; a,b) = — 7 


sampling mechanism 


Algorithm B.3 with assign routine: 
float uniformContinuous(float u, float a, float b) { 


return a + (b-a)*u; 


shape of d.f. and c.d.f. (a = 2, b = 7) 
F, F 


twisting arguments: 
(a <) S (Sa > sa), b <b) e (s 


parameters distribution: 


Fa(@) = (1- = 


maximum likelihood estimators: & = s4,b = SB 


Continuous uniform distribution describes an experiment whose 
outcomes take values in a continuous interval [a,b], with constant 
probability density. For joint parameters distribution see Example 
2.12 and for point estimators Example 2.30. 


Basic random variables 343 


Negative exponential distribution 


parameters: \ € Rt 


c.d.f.: Fy (a; A) = (1 = e**) Io, +) (£) 


1 1 
mean: — variance: —> 
x 


sampling mechanism 


Algorithm B.3 with assign routine: 


float exponential(float u, float lambda) { 
return —log(u)/lambda; 


shape of d.f. and c.d.f. (A = 0.25) 
f 


0.25 


m 
sufficient statistic: S4 = 5 Xi 
i=1 


Negative exponential distribution describes an experiment mea- 
suring the distance (in any metric) between two successive occur- 
rences in a Poisson experiment. 


344 Random variables 


Gaussian distribution 


parameters: u € R,o € Rt 


d.f.: fx (ax; à, a) 


c.d.f.: Fy (a; ,0) = 


variance: o° 


sampling mechanism 


Algorithm B.3 with assign routine: 


float normal(float ul, float u2, float mu, float sigma) { 
const PI = 3.14159; 
float theta, rho; 
theta = uniformContinuous(ul,0, 2*PI); 
rho = sqr(exponential(u2,0.5)); 
return mu + sigma*rho*cos (theta); 


m m 
sufficient statistics: Sm = yx Sse = SOX =u) 
i=1 i=1 
twisting arguments: 
(u< T) S (sa > sm), (0° < 


parameters distribution: 


maximum likelihood estimators: /i = m 6 

Gaussian distribution describes an experiment consisting in the 
sum of the results of a large number of uncorrelated elementary 
experiments. Despite its unappealing expression, the unbiased 
estimator of X is easily computable in numerical way. Z denotes 
a Normal variable, x2, a Chi-square distribution as in page 347. 


For joint parameters distribution see Example 2.25. 


Basic random variables 345 


Gamma distribution 
parameters: n € R*™,\ € Rt 


c.d.f.: Fy(z£;n, A) =| fx OG n, AJAXI, +00) (2) 


sampling mechanism 


Algorithm B.3 with assign routine: 


float gamma(float u, float n, float lambda) { 
int: i; 
float x=0; 
for (i=0;i<n; i++) 
x += exponential(u;, lambda); 
return x; 


shape of d.f. and c.d.f. (n = 2, A = 0.5) 
F, F 


m 


sufficient statistics: SA = SOX Sn 
i=l 
twisting arguments: 
A<À S (s5 < sa) (n< 

parameters distribution: 

_ mn—1 (Asa) i esa 
oeo a 
i=0 


I(0,400) (A) 


i! 


unbiased estimators: 
~ nm 
à = — 

SA 


Gamma distribution describes an experiment of summing n inde- 
pendent negative exponential experiments, each with parameter 
à. T(n) denotes the Gamma function T(n) = T e™“u” ldu. 
Fs, is the Meijer G-function [Erdélyi et al., 1981]. See Sec. 3.1.4 


for the meaning of [x]. 


346 Random variables 


Beta distribution 
parameters: a,b € Rt 


d.f.: fx (a; a,b) = 


c.d.f.: Fx (x; a,b) al fx (x; a, 6)dx (0,1) (£) + Iu, +) (x) 


a . ab 
mean: —-— variance: =n mr 
a+b a+b) (a+b+1 


sampling mechanism 


Algorithm B.3 with assign routine: 


float beta(float u1, float u2, float a, float b) { 
float v1, v2, w; 
do { 
vi=exp(ul,1/a); 
u2=exp(u2,1/b) ; 
w=vitv2; 


while (w>1); 
return vi/w; 


shape of d.f. and c.d.f. (a = 8, b = 1.5) 


B(a,b) denotes the Beta function B(a,b) = i u?—1(1 — u) tdu. 
The sampling mechanism is as in [Ripley, 1987] 


Basic random variables 347 


Chi square distribution 


shape of d.f. and c.d.f. (m = 10) 
f 


0.1 


Student’s t distribution 


; > ~  P(k+1)/2) 1 1 
d.f.: fx(z;m) = TE) vira 


c.d.f.: Fx (a;m) =| fx (x; m)dx 


: m 
mean: 0, form > 1 variance: 5 for m > 2 
m= 


shape of d.f. and c.d.f. (m = 10) 
F 


A difficult 
forecasting 


Managing 
reliability 


348 Random variables 


B.3.2.1 Examples 


Example B.12. Let’s consider a swimmer in a swimming pool 50 meters long. 
He swims at a constant speed, i.e. he covers equal distances in the same amount 
of time. Therefore, for an unknown starting time each point of the pool has the 
same probability of being crossed by him at run time. This sentence, trivial per 
se since the probability function of every continuous variable is always 0, makes 
sense if we substitute points with elementary equally sized surroundings of them. 
We model the current position with a continuous random variable X uniform 
in [a,b], with a = 0 corresponding to one side of the pool and b = 50 meters 
to the opposite one. The density function is: fx(x) = -Jja,s(x). Wanting to 
guess the swimmer position x at a certain time, we cannot expect to find him 
in any fixed point, for the same reason about X continuity. However, the value 
xo minimizing a quadratic cost function (x — zo)? in average, i.e. solving the 
problem: 

min E [(X — £)?] (B.44) 

zx 


is the expected value 


50 
E[xj=/ Sdz= 22] =2 BA 
x] [ oo” OD RA 


expressed in meters, where the minimum coincides with the variance of X, i.e. 


V[X] = (enor = 208, 33 expressed in squared meters (see page 342). Hence, a 


surrounding of xo is preferable to others. 

Wanting to estimate b from a sample of the swimmer’s observed position we 
can use the techniques explained in Example 2.11, in Sec. 2.2.1, in Example 
B.7 or in Example 2.29. 


Example B.13. Life X of a toaster, expressed in months, follows an exponential 
law with parameter \ = 0.03 month~!. The manufacturer has to decide the 
terms of warranty in order to have the probability that the toaster passes the 
warranty time without failure equal to 0.85. This is a typical example of inverse 
problem: rather than computing the distribution function in a point x, here this 
probability is given and point x is requested. In this case 


0.85 = P(X > £) =1-P(X <a2)=e** > g = 5.42 (B.46) 


Therefore the manufacturer decides on a 6-month warranty. P(X > t) is referred 
to as the reliability R(t) of the apparatus at time t. For the negative distribution 
of parameter A 

R(t) = 1 — Fx (t) =e" (B.47) 
where E[X] = + reads as the average time between two breakdowns assuming 
that the apparatus can be perfectly repaired after each breakdown — what we call 
apparatus without memory ©. A retailer assumes the manufacturer warranty 
is too conservative. Therefore he records the lifetime of 5 toasters he has sold 


6 As a matter of fact real apparatuses drift from the so simple exponential model in that A 


Basic random variables 349 


Fig. B.8: Typical shape of a variable breakdown occurrence rate A(t) in real appara- 
tuses. 


obtaining the sample: {171, 4.54, 17.4, 48.0, 4.14}. On the basis of this sample 
he gets an unbiased estimate À = 0.0204 according to page 343, and to get the 
same probability he fixes the warranty period at 8 months. 


Example B.14. A manufacturing process produces hubs, which are assumed 
to be satisfactory if their external diameter is less than or equal to 15.00 mm, 
defective otherwise. The diameter of these hubs follows a Gaussian distribution 
law. We know that with the present installations, on average, 20% of the items 
produced have a diameter less than 14.80 mm and 10% a diameter greater than 
14.90 mm. 

To know the two parameters, namely the mean and variance of the distri- 
bution law, we write a system of two equations and solve it for 4 and o°: 


+e > 14.90) =0.10 ‘ = Fm, (+2) = 0.10 


P(X) < 14.80) = 0.20 Fuy, (=) = 0.20 


. 1420S = 1.28 aJs 14.84 
i a = —0.841 o = 0.0471 


In order to decide whether or not to proceed with a general overhaul, the produc- 
tion engineer computes the expected percentage of defects, i.e. the probability 
of X being greater than 15: 


15 — 14.84 


P(X > 15) =1—P(X <15)=1-Fy,, ( eT 


) = 0.0003341 


To analyze the effects of shifts from the diameter nominal value, we can directly 
simulate the random variable c(X), being c the drawback cost deriving from 


proves a function of time like in Fig. B.8, where the two moustaches denote the running in, 
the left one, and the aging, the right one, times of the apparatus. 


and hazard rates. 


Controlling a 
batch production 


and simulating it. 


Identifying 
Gaussian variable 
parameters 


350 Random variables 


any specific value of the hub diameter. Actually, a quick implementation of 
the inverse transform method would require an analytical form of the X c.d.f. 
This form does not exist for the Gaussian distribution law. We may either use 
approximate techniques (what most packages do) or change generating strategy, 
for instance adopting the following Bor-Muller method [Ross, 1997]. It derives 
from the consideration that the joint density function of two independent normal 
variables X and Y (see (B.123)): 


il arty? 


fxy(,y) = Fx) fy) = ze (B.48) 
can be expressed in polar coordinates as 
1 p? 
fr,o(p, 0) = E = (B.49) 
which, for D = R?, may be factorized in turn in 
fn,old, 0) = 5-504 = fold) fo (O) (B.50) 


Equation B.50 implies that © follows a uniform distribution in [0,27) and D 
follows a negative exponential one with parameter \ = 1/2. Thus it is possible 
to generate the pair (O, D) by independently generating O as in Example B.5 
and D as in Example 1.10, using random seeds U; and U2, respectively. Then 
we transform them in cartesian coordinates to obtain (X,Y). 


ax 


—2 ln(u1) cos(27u2) (B.51) 
y = —2 ln(u1) sin(2ru2) (B.52) 


Each pair of components can be then used as a pair of realizations of a standard 
normal random variable No,ı. To generate a generic Gaussian variable X with 
mean u € R and standard deviation o € Rt we use the linear transform X = 
u+oZ, where Z is a Normal variable (Gaussian with u = 0 and ø = 1).. Indeed 
the change of scale does not modify the shape of the distribution law. Moreover, 
from (B.8) we have that E[X] = ys (since E[Z] = 0), and from (B.15) we have 
V[X] = o? (since V[Z] = 1). Actually, the estimates of mean and variance of 
the hubs population are commonly computed as in the next example. 


o 
Example B.15. A pharmaceutical company declares that its antipyretic in- 
duces a mean temperature decrease ux = —2 degrees centigrade with 
variance a = “0:5; A hospital measures temperature decreases djs 


{—1.15, —1.08, —2.07, —2.27, —0.74, — 1.26, —0.41, —1.19, —0.60, —2.16, —2.09, 
2.52, —1.47, —1.48, —1.55, 0.47, —2.11, —0.46, — 1.64, —1.10} on 20 patients af- 
ter they have taken the medicine. Hence, we have weak unbiased estimates 
ju = —1.344 and >? = 0.5685 of the mean and variance respectively of the ran- 
dom variable D of which the above values are a sample. Given the discrepancy 


Basic random variables 351 


between ji the u declared by the company, we may check to see if the sample 
size is large enough to ensure a shift between M and u less than 0.1 in absolute 
value with probability 0.900. We accept the variance declared by the company 
and assume D Gaussian. Then, for Z denoting a Normal variable, from 


0.1 
Fz (>) = 0.950 (B.53) 


vm 

we discover that we would need a sample size equal to 136 to fulfill our require- 
ment. If we do not accept the declared variance we may on one side estimate it. 
According to formulas on page 344 we have an unbiased estimate ©? = 0.635. 
This value is not so far from the one declared by the company, since a 0.900 
confidence interval for ©? is (0.358, 1.07). In any case we may repeat the above 
computation substituting X? with ©? and Gaussian distribution with Student t 
as in Example 2.27. 

If we do not rely on the Gaussian shape of the distribution law we may use 
the Chebyshev inequality (2.174) to state: 


2 
g 
P(|M — u] < 0.1) > 1- — = 0. i 
(IM — p| < 0.1) > 1- ~y = 0-900 (B.54) 


Giving o? the value 0.5 supplied by the company, we get m = 500, and if we 
claim 0.635 then m = 635. 


Example B.16. A truck is equipped with 8 tires, 4 of which are used while the 
remaining ones are kept in case of blow-outs. The truck driver knows that a tire 
blow-out occurs, in the roads he usually runs, meanly every 100,000 km. Having 
to drive for 20,000 km, the driver is concerned with the probability of getting 
to his destination without using up all his spare tires. As derived at the end of 
Example B.9, each of the random variables R; (i = 1,...,4) describing the time 
of puncture of the i-th wheel, is distributed according to a negative exponential 
law of parameter \ = 1 if distances are measured in fractions of 100,000 km, so 
that for instance the probability the 3-rd tire is functioning after 250 km is 


250 


P(R3 > 250/100, 000) = e~ >m = 0.9975 (B.55) 


which decreases to 0.8187 if the distance covered is 20,000 km. Now, the 
random variable describing the occurring time of the first blow-out is R = 
min{ R1, Ro, Rs, R4}, whose c.d.f. is 


Fr(t) = Pimin{R,..., Ra} < t) = 1 — P(min{ Ri, ..., Ra} > t) 
4 
=1-P(R >t Yi=1,...,4)=1- [[P(R >t) =1-(e-™)* = 1- e74 


i=l 


and checking 
their reliability. 


Surviving a long 
trip 


352 Random variables 


that is, R follows a negative exponential distribution of parameter \’ = 4A = 4, 
so that the analogous of probability (B.55) referred to 20,000 km amounts to 
0.4493. 
Moreover, any negative exponential random variable X of parameter A sat- 
isfies the property 
P(X >s+t)  e Ath) 

P(X >s4+t|X >th= n =e =e = P(X >s) (B.57) 
which is often referred to as lack of memory for the related probability distri- 
bution. Therefore the random variable describing the distance covered before 
the fifth blow-out occurs (and hence the truck needs to stop) is G = ae Ri, 
following a Gamma distribution of parameters n = 5 and A = A’. The c.d.f. of 
this distribution can be easily computed when n € N. Indeed it assumes the 
form 


+00 n 
_ A n—-1,—Au _ 
Fo(x) =1 -f am" du=1-In (B.58) 


where a simple integration by parts allows us to state the following recurrence 

relation for I, 

(Az) lee? 
(n—1)! 


dc _ (z)Pe** 
=~ oO 


In = + ae (B.59) 


which, given the fact that I = e7 
for the Gamma c.d.f. 


, leads to the following form 


n—-1 ig-Ax 
Fo(z) =1-S¢ sve 
i=0 
i.e., from the logical perspective, the probability that the fifth blow-out time 
is less than x coincides with the probability that the number of blow-outs in 
the interval (0, x) is greater or equal to 5, since the two events coincide as well. 
Thus the answer to the driver’s question is 


(B.60) 


4 


20, 000 (4 -0.2)ie-4-02 
Plas -nesan= 5 = = aes ei 
( . wa} one > ea 


i! 


B.4 Joint random variables 


B.4.1 The discrete case 


Putting many In this section we make the default assumption that we are dealing exclusively 
variables together ith discrete random variables. A replica of all definitions and results for con- 
tinuous random variables will be given in the next section. The subject of this 
section is an n-dimensional random variable X, i.e. a vector (X1,...,Xn) of 
random variables we consider jointly in order to capture features connected with 

their integrated behavior. 


Joint random variables 353 


y2 y2 


yı yı 


(a) (b (c) 


Fig. B.9: Computing the probability that the random vector (X,Y) falls in a bidi- 
mensional interval J marked by a gray rectangle, in case J is unbounded in: (a) two 
edges; (b) one edge only; and (c) no edge. 


Definition B.10. The joint cumulative distribution function of an n-dimensional 
random variable X is defined by: 


F(x) = Fyran Xa (tihein) = P(X < ieis Xn < Tn) (B.62) 


It is easy to see that: 


(a) Fx,y (#1, y1) is the probability that the random vector (X,Y) is contained 
in the (infinite) interval marked in gray in Fig. B.9(a); 


(b) P(a1 < X <a2,Y < yo) = Fx y (£2, yo) — Fx,y (21, y2) (see Fig. B.9(b)); 


(c) Pa. < X < way < Y < y2) = Fr y(2,y2) — Fxy(21,y2) — 
Fx y (£2,y1) + Fx,y (x1, y1) (see Fig. B.9(c)). 


Definition B.11. The joint probability function of an n-dimensional random 
variable X is defined by: 


fx (x) = fix,,....X,(€1,---)2n) = P(X = 11,...,Xn = Ln) (B.63) 


Equation (B.5) extends to the multidimentional case for connecting Fx to fx 
as follows: 


Distribution 
function 


Probability 
function 


354 Random variables 


We may always imagine the variable of our interest to be a component, say Xj, 
of a vector X of random variables ruled by a joint distribution law. With this 
respect Fx, and fx, are marginal functions obtained as follows. 


Definition B.12. Distribution function Fx, as a marginal distribution function 
of an n-dimensional random variable (X1,...,X,,), with respect to X;, is com- 
puted by: 

Fx, (x;) = PXY X50 Xn (+00, eng Ljyeney +00) (B.65) 


which algorithmically reads: 


Fx, (xj) = 5 sae oa aoe 5 Sxi Xj yer Xn (Zins eee s Tijo eee Xin) (B.66) 
i=1 ij= inl 


Definition B.13. The probability function fx, as a marginal probability func- 
tion of an n-dimensional random variable (X1,..., Xn), with respect to X; is 
computed by: 


Vj—ı1 Vj+1 


ta 5 5 


i=1 igea=lLiggg=1 


Un 
X Iie ian le (Diiss s Tij—19 Djs Pijpt ses Tin) (B.67) 


in=1 
E 


Knowing the joint distribution function makes it always possible to compute the 
marginal distributions, but the vice versa is not true. To cope with the inverse 
task we will refer to the conditional distribution that computes the probability 
of a component when the realizations of the others are fixed. For the sake of 
simplicity we write all the definitions for the simplest case of two joint random 
variables X1, X2, the extension to n random variables being obvious. 


Definition B.14. Given the vector of the two joint random variables X and 
Xə, we consider the following conditional distributions of X, with respect to X2 
with the following notation: 


e F and f denote cumulative distribution and probability functions, respec- 
tively; 


e subscripts to the above functions are composed of two parts separated by 
“|”: the left part denotes the random variable described by the functions 
and thus has its specification in argument; the right part identifies the 
conditioning event. 


Joint random variables 355 


As a direct application of conditional event probabilities (see Sec. 1.3.3), the 
functions are computed as follows”: 


Px Xə (11, £2) 


Fy, |X2<a2(€1) = Z a (B.68) 
fxi|X2<02(t1) = Daf alan (B.69) 
Fx, |x,=0,(%1) = Yeh Pana) (B.70) 
ÍXi|X2=22 (21) eee (B.71) 

o 


The independence between events described by random variables translates in 
the two following equivalent definitions of independence between random vari- 
ables: 


Definition B.15. The random variables X1,..., Xn are independent only if 


Fx,,...,.X,(€1,-+-;%n) =|] Fx: (2), (B.72) 
i=1 
or, equivalently, 
ÍX, Xn (T1, -+3 En) = [| ia: (B.73) 
i=1 


The inverse implication requires the factorization of the above functions for each 
possible subgroup of variables. 


The independence between random variables implies that no distinction exists 
between their marginal and conditional distributions with respect to the one 
another. By a simple algebra we have from (B.71) that 


fxi, xa (£1, 2) = Fx1|Xe=a2 (1) fx, (22) (B.74) 


In case X; and Xə are independent, i.e. 


fxi xa (#1, £2) = fx, (21) fx. (2) (B.75) 
(B.74) allows us to say that: 
fx1|X=09 (21) = fx, (21) (B.76) 


When no ambiguity arises, ÍX1|X2=z2 is denoted fx, jæ, for short. 


Conditioning in 
different ways 


Independence 


A necessary 
condition, 


a heavier 
sufficient 
condition 


Chain rules 


356 Random variables 
In any case 


Fy, geass KX. (Pipen) = 


Fy, (z1)Fxz|Xı <z: (22) «PX <21, Xn-1<2n-1 (£n) (B-77) 


E E EA O EEEE DE 


fx, (Ti) xia, (x2) ses DE NE See: os Meek (Pa) (B.78) 


Getting Example B.17. James and Brett are playing a single shot in a dart game. The 
information fo stakes are: 30Y to the contender who hits the first circle, 20.Y to the contender 
who hits the second one, 10.¥ to the contender who hits the third, 0.2 if no circle 
is hit. The winner will get the money corresponding to the difference between the 
players’ scores. Let X; be the value of James’ shot and Xəz be Brett’s. Among 
the friends, Charles organizes the bets on how much James will win or lose by 
considering the random variable Y = Xı — X2, while Simon organizes bets on 
the total amount of money that will pass between the contenders by considering 
as random variable the Y module: |Y|. The sample space is Dx = {0, 10, 20, 30} 
for both variables. Since James and Brett are sufficiently good players, to decide 
the score of the bets the friends assume the probability of hitting one circle to 
be the same as hitting another, in spite of the different circles areas (actually 
the points are associated to the first circle and to the subsequent rings coming 
from difference of circles of radii increasing by an additive constant). Thus 
they rely on a discrete uniform probability function of parameter N = 4 for 
both X;’s. The sample space of jointly observing X; and Xə is composed by 
all the ordered couples (x1, £2), as listed in the first column of Table B.2. The 
second and third columns of the table report the corresponding values of the 
pairs (X1, Y), and(X1, |Y |). It emerges that the cardinality of the sample space 
underlying the joint variables X; and Y is the same of (X1, X2), while the one 
of joint variables X, and |Y| is less since the realizations of (X1, X2) (10,0) 
and (10, 20) collapse in the same event (10,10). Similarly, (20,10) and (20, 30) 
collapse in the same event (20,10). The probabilities of these events, i.e. the 
values of the joint probability functions of these variables, are reported in Table 
B.3. Namely, since all specifications of (X1, X2) are equiprobable and are in 
numbers of 16, fx, x,(@1,%2) = $, V@1, £2. The same happens for (X1, Y), 
thus fx, y (z1, yY) = $ Vz1, y. 
On the contrary (X1,|Y|) is not equidistributed, since two points have 
greater probability than the others; namely: fx, )yj(v1,y) = $ for each 
(x1, |y|) € Dx x Diy; — {(10, 10), (20, 10)}; and fx, yi(£1;y) = $ for each 
(xı, |yl) € {(10, 10), (20, 10)}. To compute the related cumulative distribution 
function it is sufficient to sum all the probabilities of the points having both 
the coordinates less or equal than the requested ones as formally pointed out in 
(B.64). The shaded rectangles in Table B.3 suggest how the c.d.f. is computed. 
Namely, hiding the pairs with null probability, we have: 


Joint random variables 357 


Table B.2: Sample spaces of (X1, X2), (X1,Y) and (X1,|Y|) in Example B.17. 


(XL Y) | (X1, |Y) 


Table B.3: Joint probability function of (X1, X2), (X1, Y) and (X1, |Y |). Gray areas 
highlight the probabilities summing up in (B.79) and (B.80). 


Xə 
10 20 30 | -30 -20 —10 10 20 30 10 20 30 


2 6 
Fx, y (10,20) = 5 5 fxv (in, i2) = 


ig=1 ig=1 
fx.y (0, —30) + fxy (0, —20) + fx.y (0, —10) + fx.y (0, 0)+ 
fx y (10, —20)+ fx y (10, —10) + fx. y (10, 0) + fx.y (0, 10) = 


(B.79) 


2 3 
wo 7 
Fx, Jy\(10, 20) = oD 32 fx, tv (ie) = 7 (B.80) 
i=l i2=1 
To decide the score of the bets Charles needs the marginal probability function 


of (X1, Y ) with respect to Y. For example, according to (B.67) he can compute 
fy(—30) as: 


Spanning the 
expectation on 
many variables 


358 Random variables 


fy (—30) = fx,,y (0, —30) + fx, y (10, —30) 


1 
+ fx. y (20, —30) + fx,y (30, —30) = TG (B.81) 


This amounts to summing the probabilities stored in the 5-th column of Table 
B.3 (corresponding to Y = —30). Repeating the same operation on the other 
lines, he obtains the full description of fy: 


2 3 4 
fy (—20) 16 fy(—10) 16 fy (0) T (B.82) 
3 2 1 
i = — J = — = = 
fy (10) 16 fy (20) 16 fy (30) iG 
Similarly, Simon discovers the probability of any possible money broadcasting: 
4 6 4 2 
fry) = TA fiy|(10) = TA fiy; (20) = T fiy| (80) = TA (B.83) 


Richard is sitting in an unfavorable spot, thus he cannot see the target. Rather 
he can hear the values of Y claimed by Charles and of |Y | by Simon. He must 
choose on which of the two results to base his guess about James’ (i.e. on the 
Xı specification). One way is to consider the conditional probability function of 
Xı given specifications of Y or |Y|. For instance, for Y = 10 he has, according 
to (B.71) 


0 =z 1 
fx,}y=10(0) = a fx,}y=10(10)= = =" 
16 16 
1 1 
= l A 1 
fxy|v=10(20) = $= 3 fx:jy=10(30)= 3 = 3 
16 16 
and for |Y| = 10 he has 
is_ i we _ 2 
fx, \1v|=10(0) = += ra fx,|ly|=10(10)= = = 5 
16 16 
2 1 
= 2 = 1 
Fx, |!y|=10(20) = $ 5 fx,||¥|=10(30)= cs =" 
16 16 


Thus he appreciates that on this data item the information coming from the 
first claimer is more interesting for discovering James’ inability to hit the target. 
Similar considerations on the other possible data will convince Richard to pay 
attention to Charles’ data rather than to Simon’s. 


E 
Definition B.6 extends to the multi-dimensional variables as follows: 


Definition B.16. The expected value of a function g of X is defined by: 


i1=1 in=l 


Joint random variables 359 


In particular: 


EIX] = oe Se wi fee ings Tin) (B.85) 
i7=1 in=1 


Very often it is useful to compute (B.84) by factorizing the joint probability 
function as in (1.60). Thus: 


Elg(X1, Sak ,Xn)| = Ex, [Ex, |. . Ex,_,| 
Ex, lg9(Xı, KE ,Xn)|X1, os .; Xn—-1]|X1, see Anp] soe |X] (B.86) 


where Ex, |.|Xz,...,X,] denotes a random variable having X+, ..., X, in argu- 
ment as a result of an expectation operator on the random variable X,. 
Check: Let’s consider, for simplicity’s sake E[g(X1, X2)]. We may compute 
this expectation as follows: 


V1 v2 


Elg(X1, X2)] = 5 5 g(Tir, Tia) fx, (£i) f x21 X1=r; (ziz) = 


i1=1 i2=1 


vı 

5 Ex, [g(x , X2)|X1 = Ti] fx, (zi, ) = Ex, [Ex, [9(Xı, X2)|X1]] (B.87) 

i=1 
Example B.18. Consider the experiment of drawing n times from an urn con- 
taining N balls, K of which are red, without putting the drawn balls back in 
the urn. Number X of red balls drawn follows a hypergeometric distribution 
and can be expressed as the sum ae X;, where X; is the Bernoulli random 
variable expressing the result of the i-th extraction. Clearly, these variables 
are not independent, as at each extraction the urn composition changes. It is 
easy to derive E[X1] = &, while E[X2] can be computed through application 
of (B.86): 


E[X2] = Ex, [Ex,[X2|X1]] 
= Ex, [X2|X1 = oJP(Xı = 0) + Ex, [X2|X1 = P(X = 1) 
K K K-1K K 
= ~ — — —— ss B. 
x ( x) +33 ca 


We can show that E[X;] = & for the remaining variables, too. Therefore the 
mean of X will be 


E[X] = 3 E[Xi] = n£ (B.89) 


thus coinciding with the analogous quantity arising from a binomial experiment, 


in which the drawn balls are put back in the urn. The corresponding variances 


are different, being equal to ně (1 = 5) —- in the former case, and to the 


greater value ně (1 — x) in the latter. 


Progressive 
expectation 


Drawing a set of 
non independent 
elements 


Mixed moments 


Covariance 


Correlation 


An elementary 
moments’ algebra 


360 Random variables 


In addition to the vector of means E[X] = (E[Xq],...,E[X,]) and the vector 
of variances V[X] = (V[Xq],...,V[Xn]), a new set of mixed moments, i.e. 
expected value of products of different variables, is of general interest. Their 
second order elements are defined as follows: 


Definition B.17. Let X1, X2 be two joint random variables. Their covariance 
is defined as: 


Cov[X1, Xo] = 0X1, X2: = E|(Xı = Ux,)(X2 = UX>)| (B.90) 
Oo 


An equivalent definition, often preferable for computational reasons, is the fol- 
lowing 
OX,,X_ = EX Xa] — ux Hx (B.91) 


The common normalized form of this parameter is represented by the correlation 
coefficient, whose expression is: 


Definition B.18. The correlation coefficient between the joint random variables 


X 1, Xo is: 


7X1,X2 
= B.92 
PX1,X2 ce ( ) 


under the obvious condition that none of the two standard deviations is null. 
| 


These two quantities account for the linear relationship between two random 
variables. In particular the correlation coefficient is adimensional and takes 
value inside the interval [—1, 1]. Two random variables X and Xə having null 
covariance are said to be uncorrelated. On the contrary, if Xı = aX, their 
correlation coefficient is either +1 or —1 depending on whether a is positive or 
negative. 


Fact B.1. The following sentences follow directly from the above definitions: 


e If Xı and Xə are independent, then gi(X1) and go(X2) are independent 
too; 


o If Xı and Xə are independent, then: 
Elgi(X1)g2(X2)] = Elg: (X1)]E[g92(X2)] (B.93) 
e Xı and Xə are independent, then: 


Cov[X1, X2} =0 (B.94) 


e For every pair X1, X2 


EX, + Xo] = E[X1] + E[X)] (B.95) 


Joint random variables 361 


e If Xı and Xə are independent, or even if they are uncorrelated, then: 


V[X1 + X2] = V[X1] + V[X:] (B.96) 
e In any case: 
VIX +X} = VIX] +V[X2] + 2Cov[Xı, X2] (B.97) 
Cov ey = SS CovL%i, YI (B.98) 
i=1 j= i=1 j=l 


Example B.19. Consider again the game of James and Brett. 

To compute the expected amount E[Y] and E||Y|] of the quantities considered 
by the bookmakers we may either refer directly to the distribution laws reported 
in (B.82-B.83) or consider the dependence of Y and |Y | on the original variables 
Xı and Xə. For instance we compute 


4 4 
E{Y] = de = z2) fX, X2 (£1, 22) =0 (B.99) 
a 25 
E(|¥ |] = 2 lylifiv (uli) = > (B.100) 


To give me odds Simon waits to receive my stake after James’ throw. But I 
suspect that this is just a trick, compelled by the considerations at the end of 
Example B.17, since I guess that the next outcome of |Y | is independent from 
X,’s. Indeed I compute the correlation between the two variables as: 


Cov[X1, |¥]] 


= eer B.101 
PX1,Y ox, OY] ( ) 
where (B.102) 
Cov[X1,/Y|] = E[Xi¥|] — ELEY] (B.103) 
4 4 
EMV = XOY ealylifx, jiu luli) (B.104) 
i=1 j=1 
which makes 
E[X1|Y]] = 32 (B.105) 
Cov[X1,|Y|] =0 (B.106) 
Px,y =0 (B.107) 


Therefore I conclude that the two variables are linearly independent. Simon 
agrees with me, but also reminds me that linear independence does not make 


Uncorrelation 
does not make 
independence 


362 Random variables 


necessarily for the independence of the two variables. For instance, from Exam- 
ple B.17 we have : 


1 
fx4|)y|=10(10) = 3 
that is different from the unconditioned probability: 
1 
fx, (10) = 1 (B.108) 
a 


The multinomial Example B.20. A binomial distribution models the execution of a set of n 
independent experiments, each classifying the results in two classes (“success” 
and “failure”). The multinomial distribution can be viewed as an extension of 
the binomial one, where now each of the n experiments can have k different 
results. More precisely, the result is i € {1,...,k} with probability p; € [0,1], 
and 5 —1ı Pi = 1. We are interested in gaine how many experiments resulted 
in each of the possible outcomes. Thus we are interested in a vector v = 
(v1, ..., Ug) such that De vi = n; the related probability function is 


k k 
n ù; 
P(Vi = v1,- ., Vk = vk) = (,, A) J [ p?" ro... (win (>: n) 
yee, UR) J = 


(B.109) 
where (,, mi 4 = wo is the multinomial coefficient expressing in how many 
ways n objects can be divided into k subsets of cardinality v1,...,vU,%, respec- 
tively. 

If we define a set of nk Bernoulli random variables {Z!,i € {1,...,n},l€ 
{1,...,k}, where Z! assumes the value 1 if the i-th experiment has outcome | 
and 0 otherwise, given s,t € {1,...,k} with s Æ t we have 

v = Zz (B.110) 
i=1 

v = YZ (B.111) 
j=l 


Cov[Vs, Vi] 


Cov yz Dz sy cams Zt] (B.112) 


lys 
As the experiments are independent, Z? and Zi are independent for every s and 


t whenever i 4 j. Moreover, the products Z7$Z! equal 0 for each i when s Æ t. 
Thus 


Cov[V., Vi] = > Cov[Z?, Zi] = X E[Z} Z} — E[Z}]E[Zi] 
j=1 


Joint random variables 363 


=-) E[Z}]E[Z}] =—npsp, (B.113) 


j=1 
| 


B.4.2 The continuous case 


Here we provide the continuous version of the previous section formulas that 
involve probability functions. Although maintaining the same symbol fx, the 
non cumulative description of a continuous variable distribution law has the 
meaning of a probability density as explained in Sec. 1.3.1. 


Definition B.19. The joint density function è of a continuous random variable 
(X1,...,Xn) is: 


OP Fx, ox (Lieve, Ba) 
a pS —$—$————— B.114 
fx ,..Xn (1, E ) Ox, ...0Xy ( 


We compute from it the joint c.d.f. through: 


Tı T2 
Fit inst) =f |. "i fxi Xa Xie Xn AX ed Xn 


(B.115) 
More in general we pass from the discrete to the continuous version of a formula 
by substituting integrals to sums. Therefore we isolate the marginal c.d.f. Fy, 
of a single component by integrating the contribution of the other components: 


+oo Tj +00 
B e ee 
Xi, Xn (£1, eae ae Paras , Ln) dry os . déj a den (B.116) 


Similarly, we obtain that fx, as a marginal density function of an n-dimensional 


continuous random variable (X1,..., Xn) with respect to X; is computed by: 
-+0o ~ 
(ty) =f. af i aa, e O E i n)da1.. dar j—1d@ 544 sadin 
(B.117) 
Moving to the conditional distributions, we have that 
OF a Loo d Mas Mal ie Ga Es 
fxn (ai) = OE xuxal@at2)/O0 Sedand og 149) 
Px, (x2) Fx, (x2) 


8with the mentioned mathematical caveats. 


Moving from 
sums to integrals 


d.f. is a 
derivative 


c.d.f. is an 
integral 


conditional 
functions may be 
jointly both. 


Splitting the 
expectation 


The Gauss bell 


Full symmetry 
hypothesis 


364 Random variables 


and? 


Py xa, (en) = PEx (rs 22)/3r _ fie fxi (En 2)dé 
Xı|X2=z2 (71) = — Fx, (æə)/ðr fea) 


(B.119) 
Finally, the expectation of g(X1,..., Xn) is computed by: 
Elg(Xi,- ate x 


n)] = 
+00 -Hoa 
J | Glisic Un) fx,,....X,(U1,-++;%n)da,...dr%, (B.120) 


Example B.21. Let U be a uniform random variable in [0, 1]. Consider a random 
variable X distributed according to a Bernoulli distribution whose parameter 
is in turn distributed according to U. According to (B.86) we can compute the 
mean of X as 


E[X] = Ev[Ex[X|U]] = 


Example B.22. Consider the bidimensional extension of a Gaussian distribution 
by requiring a random vector (X, Y) to have probability density f symmetrically 
distributed around a point in the Euclidean plane. If we assume that this point 
coincides with the axes origin, and further require that the expression of f does 
not change for a rotation of the coordinate axes, then the above property reads 


fx,y(z,y) = g(x? +) (B.122) 


If we add the requirement of independence between X and Y, it follows easily 
that any exponential function g satisfies (B.122). We can obtain the analytic 
form for f requiring that it describes a probability density, i.e. its integral over 
R? is 1: 
Le) 1) ) 
fxy (x,y) = ———e °\"* Y (B.123) 
270x Oy 

where ox and oy compensate scale changes in the coordinates. We can verify 
easily how using as symmetry center a point (ux, py) different from the axes 
origin turns into the following form for the joint d.f. 


1 #( (S38) +(584)’) (B.124) 


——e 
2T0x Oy 


fx.y (x,y) = 


Figures B.10 and B.11 show the typical bell shapes of this d.f. 


Joint random variables 365 


fx,y (x,y) 


Fig. B.10: Shape of the d.f. of a bidimensional Gaussian distribution with independent 
components of parameters wx = py = 5 and ox = oy = 3; (a) and (b) show the plot 
and contours of fx,y, respectively. 


fx.y (x,y) a 


Fig. B.11: Shape of the d.f. of a bidimensional Gaussian distribution with independent 
components of parameters wx = Hy = 5, ox = 3 and oy = 4; same notation as in 
Fig. B.10. 


Squeeze the bell 
if the 
components are 
correlated 


The marginal 
distribution is still 
Gaussian 


Independence is 
just a matter of 
perspective 


Regression line 


366 Random variables 


Note how (B.124) can be expressed as 


1 —4(a- sY— by Do E] 
fx y (2, y) = — y e 2 giy (B.125) 
(v27) VIX 
where x] ; 
— V X Cov X,Y — OX OXY 
i Ce: Y] v[Y] ) E = oe ) (B.126) 


with Cov[X,Y] = 0, where © is called covariance matrix of the distribution. 
Expression (B.125) generalizes in two ways. For Cov| X,Y] 4 0, hence px,y # 0, 
(B.124) moves to 


1 =a 1 z-e 2» oie pony 4 yony 2 

fx.y (2, y) = 21-3 (( ta )"—2px,v x Y ( p )*) 
Qnaxay,/1— Pxy 

(B.127) 


Moving from 2 to n components for X, and to analogous n components for p 
and n? components for X, we have 


1 


(v2m)" Vl 


where || is the determinant [Smirnov, 1961] of ©. Coming back to two dimen- 
sion for expository reasons, the marginal distribution, say of X, is 


fx(x) = ge 2B) Eee) (B.128) 


ROS 1(e-ux \? 
ra= f fx, (z, y)dy = l =) (B.129) 


ae 270x 


i.e. that of a (unidimensional) Gaussian distribution of parameters ux and ox. 
The effect of introducing a non null covariance between the components of the 
random vector has the effect of rotating the d.f. contour lines. At that point they 
become ellipses (the so-called concentration ellipses) whose axes are no longer 
parallel to the cartesian axes, as illustrated in Fig. B.12. A stricter analysis 
would show that the eigenvectors of X identify the axes of the d.f. contour lines 
and that the related eigenvalues are proportional to the variance around these 
axes [Morrison, 1967]. Thus a specific rotation of these axes turns (X,Y) ina 
new vector whose components are independent. Finally, the conditional density 
function of one component given the value of the remaining one reads 


1 al usp i 
fy|x=«(y) = ieee ( ¥ (B.130) 
Y 


where 


9We may compute Fx, |Xo=29 (xı) as the limit with Azə —> 0 of P(X, < z1 Uz2 < Xo < 
FX, Xo (t1,22tAxr2)—F x, , xo (21,22) 


z2 + Ax2|£2 < X2 < ag + Are) = — Feet) By dividing the two 
terms of the fraction by Azz you get the two derivatives in (B.119). 


Functions of random variables 367 


fx,y (x,y) 


(a) (b) 


Fig. B.12: Shape of the d.f. of a bidimensional Gaussian distribution of parameters 


ux = py = 5, ox = 3, oy = 4 and px,y = —0.5; same notation as in Fig. B.10. 
oy 
wo = py +pxy—(x- ux) (B.131) 
Ox 
oy = a(l- phy) (B.132) 
Thus 


e the conditional expectation of Y given X = v lies on a line; 


e conditioning Y with the knowledge of X always reduces the variance of 
Y, unless X and Y are independent; 


e the variance of Y given X = x does not depend on zv. 
E 


Remark B.3. Equation (B.131) identifies a regression line between (X, Y) spec- 
ifications. Namely, in the Kolmogorov framework p’ reads as the expected value 
of Y given a X specification, where the gap between Y and such expected value 
is a Gaussian variable of mean 0 and variance o’? in (B.132). Point estimators 
A and B in (4.104) and (4.105) represent the plug in estimators of uy and 
pxy ee 


B.5 Functions of random variables 


Take a function g : R” + R, substitute its arguments with random variables 
Xj,...,Xn, and interpret the whole as a generator of a new random variable 
Y assuming a specification g(a1,...,%) each time X1,..., Xn assume speci- 
fications 71,..., £n- The methods we show in this section aim to capture the 


’ ’ 


distribution law of Y from the joint distribution law of (X1,..., Xn). 


Reading the 
variables with a 
different code 


does not change 
the probability of 


the boiling down 
events. 


e/a 
YX 


Y = XiX 


368 Random variables 


B.5.1 From Fix, Xn) to F(X, Xn) 


For g : R” œ R and y € R, let us denote by g`™!(y)ẹ any point 


{T U) -9n (Y))} € R” such that g((97"(y)k,---+9n'(¥)k)) = y, where 
subscript k, which we will hide henceforth, indexes the points corresponding to 


a same y. Let us also denote by Dg-:1(a,b) the set U{g~'(y)|y € (a, b)}, i.e. the 
set of points in R” that fed in input to g translate in a y € (a,b). According to 
this notation Y < y is equivalent to X € Dg-1(—00, y). Ergo: 


Fact B.2. For any y,g,X as in the above notation 
Fy (y) = P(X € Dg-1(—00, y)) (B.133) 


Example B.23. Let X be a continuous uniform random variable in the interval 
[1,2], compute the density function of the random variable Y = +. 

For this variable fx(x) = Ip, (x) and Fx (x) = (x — 1)In a(x) + L2,400](2). 
We first derive the cumulative distribution function: 


y 
1 1 1 

=1-P(x<2) =1- Fx (=) =2—- (B.134) 
y y y 


PY <u) =P (= <v)=P(x>=) 


The interval where it holds is given by the condition 1 < 2 < 2 which translates 
in 4 <y <1. Thus we obtain the following cumulative distribution function 


1 
Fy(y) = (2 = 2) Tia ayy) + Ia, +0) (y) (B.135) 
and, deriving it, the density function 
1 
fy (y) = slay) (B.136) 


Example B.24. Let U; and U2 be distributed according to a uniform distribution 
in [0,1]. Consider the random variable X = U,U2, for which we can derive the 
following c.d.f. 


1 pæ/u 
Fx(z) = // FU; U2 (u1, U2)durdug = | | fuu (u1, u2)duzdu1 
urug [r 8 
(B.137) 
Applying the substitution v = u uz and exchanging the two integrals, we get 


gz 1 1 
F(a) = f f q, fui (a2) dujdv (B.138) 


Functions of random variables 369 


so that the density of X, for independent U; and Us, is 


tl 1 
1 T 1 £ 

fx(2) a — JOnt (a2) du: = — Io) (=) du 
o U1 ui o u Uy 


1 dl. 
1 1 
= —Io,1) (2) I(@,1) (u1)duy = I(o,1) (a) f —duy 
o U1 x U1 


=-—In xI(o,1) (x) 


Fact B.3. If X follows a Gaussian distribution of parameters u and Z = a, 
At follows a Normal distribution. Namely 


Fx(t) = Fz (=) (B.140) 


B.5.2 From fx to fax) 


Focusing on a monotone function Y = g(X) of a single random variable, deriving 
the two members of the equation 


y g *(y) 
f tows ioa (B.141) 
we obtain the quick rule: 


fy) = Fx (WIG) O) (B.142) 


where g’ denotes the derivative of g. A mnemonic way of enunciating the equa- 
tion is: P 

_ -1 oy 

fru) = feo) |! 


(B.143) 


Fact B.4. With same notation of Fact B.2 


#D,-1{y} 


fyry= So fx) (oY w))al (B.144) 
k=1 


Example B.25. From (B.142) we immediately derive that a linear mapping from 
X to Y: Y =aX +b introduces a scale change also in the d.f.s, namely: 


fry) = fx (! — r 7 (B.145) 


a a 
Oo 


In the general case (of non necessarily monotone function), (B.142) simply ex- 
tends into 


Normal variable 
is a linear 
transform of a 
Gaussian variable 


The infinitesimal 
may be different 


if we pass from 
X to Y 


Yo Xi + Xe 


Working in a dual 
space 


may simplify 
problems 


370 Random variables 


Example B.26. With same technique as in Examples B.23 and 1B.24, or directly 
from Fact B.4 we may prove that for any pair of continuous random variables 
Xı and X; the d.f. of the variable Y = X, + Xə is given by 1: 


+00 +00 
fy(y) =i fxi, x2 (21,Y — z1 )dz1 = fxı.x2(Y — £2, £t2)dz2 (B.146) 


—co —co 


B.5.3 Using the moment generating function 


Definition B.20. Given a random variable X, its moment generating function 
mx is defined as: 
mx (t) = Ele’*] (B.147) 


if it exists 1. a 


This function is a powerful tool for building and analyzing probabilistic models, 
as the reader can see in greater depth elsewhere [Cramér, 1958]. Here we just 
recall two properties of this function which will be employed for deducing the 
distribution of the sum of independent random variables. 


Fact B.5. 


e If X, and Xə are independent random variables then mx,+x,(t) = 
mx, (t)mx, (t). 
e If Xı and Xə have the same moment generating function, then they follow 


the same distribution law as well. 


Example B.27. Let’s consider h independent random variables X1,..., Xp, dis- 
tributed according to a Gamma distribution of parameters n € Rt and \ € Rt. 
Their moment generating function is 


mx (t) = ~ eft A” ete "dy 
0 T(n) 


À ee (à — t)” n—=1 —(à—t)x A 7 
= (35) I Tn) a= (545) 


as the last integral amounts to 1, being computed on the d.f. of a Gamma 
distribution of parameters n and A — t. From Fact B.5 we derive 


A a 
Myhr x; (t) = (>) (B.149) 

and therefore D X; follows a Gamma distribution of parameters nh and A. 
oO 

10Tn case of independent variables the first integral in (B.146) reads f T S fx, (21) fx (y — 


zı)dzı having the typical format of a convolution integral f g(€)h(r — €)dé. 
lie. if the expected value (B.147) is finite. 


Functions of random variables 371 


Fact B.6. Using the moment generating function we easily realize the following 
points that denote the involved distribution laws as reproducible: 


e the sum of n independent Bernoulli random variables of parameter p is a 
binomial random variable of parameters n and p; 


e the sum of n independent binomial random variables of parameter p and 
ki is a binomial random variable of parameters S ki and p; 


e the sum of n independent Poisson random variables of parameter p; is a 
Poisson random variable of parameter Y`;_; Hi; 
Pp i=1 Hi; 


e the sum of n independent negative exponential random variables of param- 
eter A is a Gamma random variable of parameters A and n; 


e the sum of n independent Gamma random variables of parameter A and 
. . n 
vi is a Gamma random variable of parameters A and X ;—; Vi; 


e the sum ofn independent Gaussian random variables of parameter u; and 


2. ; ; no n 2, 
o? is a Gaussian random variable of parameters X` ;—; pi and X; 07; 


e the sum of n independent Chi square random variables of parameter vi is 
a Chi square random variable of parameter Y`; Vi. 


The function m x(t) derives its name from the valuable property that: 


Fact B.7. 5 () 
p= = = (B.150) 


where subscript t = 0 means that the derivative must be evaluated for this value 
oft. 


Example B.28. From (B.148) we have that the mean of a Gamma variable X 
with parameters À and n is 


ela en aa O 


Remark B.4. From Fact B.5, moment generating functions univocally identify 
the random variables they refer to. Fact B.7 explains this property in terms of 
Fact 1.1, since if two variables have same moment generating functions, they 
have same moment of any order as well. = 


specially for 
managing sums 
of variables. 


The mother of all 
moments 


Identify variables 
through a limited 


number of 
moments 


Rules for 
combining 
moments 


A suitable 
general purpose 
inequality 


Another suitable 
general purpose 
inequality 


372 Random variables 


Fact B.8. Given a sample Xm of a Gaussian variable, the statistics X` ;-; Xi 
and X; ,(X; — X)? are independent random variables. 


The sentence, that is based on the special features of the e“* expectation 
for X Gaussian, has two very suitable corollaries. 


Corollary B.1. 
e Sample mean and sample variance of a Gaussian variable are independent, 


e sum and difference of two independent Gaussian variables are independent 
as well. 


B.5.4 Broad identification of transformed variables 


In spite of Fact 1.1, we may decide to bind our knowledge about X to a limited 
number of moments, for instance mean and variance. Thus we may assume 
that X and Y are broadly equally distributed just because they have equal 
these two moments. Otherwise we may decide to approximate Fx with a c.d.f. 
having same mean and variance, possibly in light of the central limit theorem, 
if applicable, etc. Here we recall some facts of life about the expectation of a 
function g of random variables. 


Fact B.9. 


E[X1 Xo] = E[XiJE[X2] + Cov[X1, Xə] (B.152) 


o If Xı and Xə are independent: 


VXXX] = V[XiV [X29] + E[X] V [Xo] + E[X2]V[X1{B.153) 
ape = E[XiJE Fa (B.154) 


o If E[X\]? < +00 and E[X2]? < +00 then ? 


E[X1 Xe]? < E[X7]E[X3] (Cauchy-Schwarz inequality) (B.155) 


Fact B.10. Given a Gaussian variable X of mean u and variance o°, we get a 
Normal variable Z by the following relation 


X—p 
E (on 


(B.156) 


(see (B.8) and (B.15)). 
Fact B.11. Given a random variable X, consider a generic function g on it: 


12 setting Xı = zı — pı and X2 = X2 — u2 we obtain PX, X2 < 1, hence —1 < px,,x, <1 


Functions of random variables 373 


Fig. B.13: Exploiting (a) the convexity, or (b) concavity of a function g in order to 
broadly identify E[g(X)]. 


o if g is convex, Elg(X)] > g(E[X]). Indeed, the convexity of g implies that 
a linear function | must exist such that 

g(E[X]) = EX) (B.157) 

glz) > U(x) (B.158) 


where I(x) = a+ bx for some a,b € R (see Fig. B.13(a)). This implies 


g(E[X]) = (E[X]) = a + bE[X] = Ela + bX] = E[I(X)] < Elg(X)] 
(B.159) 
This result is known as the Jensen inequality; 


e if g is concave, Elg(X)] < g(E[X]). This result is obtained as in the 
previous point, now exploiting the g concavity (see Fig. B.13(b)); 


e if g is linear, Elg(X)] = g(E[X]). Indeed, as g(x) = mx +q for some 
m,q ER, 


Elg(X)] = ElmX + q] = mE[X] + q = g(E[X]) (B.160) 


Example B.29. From the above fact it descends that: 


o V[X] = E[X?] — Ela]? > OVX, since x? is a convex function, E[1/X] < 
1/E[X]|VX, since 1/x is a concave function. 


Fact B.12. For every sample {X1,...,Xm}, X = -DD Xi, and S? = The pivots of 


=p classical 
1 m —_—_ f statistical 
m—L Dai (Xi X) $ inference 


E[X] = Eļ|X]; v- (B.161) 


E[S] = VX] (B.162) 


374 


Random variables 


B.6 Exercises 


1. 


A game uses a tetrahedral fair die whose faces are marked with numbers 
1,2,3,4. If X associates its specifications to each face of the rolling die, 
which is its expected value and variance? The die is rolled 20 times. What 
is the probability that we obtain the number 3 12 times? 


. An hexaedrial die is rolled 600 times; let X be the random variable count- 


ing the number of throws getting 6. Show the probability distribution for 
this variable and then calculate the expected value, the variance and the 
mean square error. 


. An airplane has 4 engines, each having probability p = 0.10 of breaking 


during a fixed trip. Calculate the probability the plane arrives safely at 
destination, knowing it may fly even with only one engine functioning 


properly. 


. Statistics from the National Health Service show that approximately 1 


baby out of 100 is born with a specific metabolism defect. If, in a given 
day, in a standard hospital, 4 babies are born, calculate the probability 
that: 

(a) none has the defect; 

(b) at most 1 baby does; 


(c) at least one has it. 


. On average, 15% of the students in the Statistics course fail the written 


examination. There are 40 candidates for the summer session. Determine 
the probability that: 

(a) at least one fails the examination; 

(b) at most 39 students pass; 

(c) between 10 and 20 students fail. 


. In a charity lottery 1000 tickets are sold, among them 100 are winning. 


How many tickets have to be bought in order to find at least 1 winning 
with a probability of 0.8? 


. In order to appreciate the ethical quality of the football teams, the federal 


committee performs the antidoping control on the official 11 players of 
10 teams, obtaining the following sample of percentages of doped play- 
ers: {9.09, 27.3, 9.09, 18.2, 0.0, 18.2, 27.2, 18.2, 0.0,0.0,9.09}. Let p be the 
parameter of the Z bernoulli variable that is equal 1 if a player is doped 
and 0 otherwise. Provide on the basis of the above sample: 


(a) A sufficient statistic for P. 
(b) A 0.900 two-side confidence interval for P. 


Exercises 375 


10. 


Il: 


(c) A point estimator p of P and of the mean and variance of Z. 


It is decided a new antidoping control after a match between team A and 
team B. Team A effected a replacement while team B four replacements. 
At the end of the game five among the 27 players who actually played the 
match were reported to antidoping control. 


e On the basis of found P, 


(a) compute the probability that at least one player tests positive at 
the antidoping control. 

(b) compute on how many players the control must be done in order 
to find at least one positive with probability 0.900. 

(c) the antidoping control has been extended to the 27 players. 
Knowing that one of them has tested positive, compute the prob- 
ability that he is on team A. 


e On the basis of the found confidence interval for P, compute analo- 
gous confidence intervals for parameters questioned in the previous 
points. 


e On the basis of the distribution law of P, compute (intervals for) 
unbiased estimators of the above parameters. 


. A radioactive source contains 9 - 101° nuclei, each one with a p = 10710 


probability of decaying per minute. 


(a) On average, how many decays per minute are we going to observe? 
(b) What is the probability of observing 9 decays? 


(c) What is the probability of observing a number of decays which differs 
3 or more from the expected value? 


. For security reasons, at the airport before boarding every passenger is sub- 


ject to a brief body search. Suppose that the mean number of passengers 
independently arriving at the check point every minute is 5, then: 


(a) Identify a proper model for describing the number of arrivals. 


(b) Calculate the probability that only 1 passenger comes to the check 
point between 10:00 and 10:01 a.m. 


(c) The policeman is able to frisk at most 10 passengers per minute. 
What is the probability that in the first minute of the service one 
passenger is forced to stand in line? 


Every morning at 7:00 a.m. an old lady leaves in her yard a bowl full 
of food for the stray cats living in the neighborhood. On average, 2 cats 
come independently every hour to see if there is food for them. What is 
the probability that the first cat comes by 8 a. m.? 


The engines mounted on a fleet of airplanes have a constant failure rate 
of 1076 Km". 


376 


12. 


13. 


14. 


15. 


Random variables 


(a) What is the probability that the airplanes equipped with only one 
engine reach New York from Rome (8000 Km) without crashing? 


(b) Calculate the same probability for an airplane equipped with two 
engines so shrewdly mounted that the failure of one engine does not 
affect the reliability of the other. 


A small telephone company uses a voice recognition system for dis- 
patching phone calls. The daily numbers collected in a given week are 
{80, 90, 70, 77,75, 85,83}. Describing these numbers through a Poisson 
random variable Z whose parameter u accounts the mean number of calls 
per day, 


(a) Provide a minimal and sufficient statistic for A. 


(b) Derive from a suitable twisting argument the cumulative distribution 
function of A. 


(c) Compute a point estimate of A. 


Of the received messages, only 70% are correctly recognized and conse- 
quently well served. 


(d) Describe the distribution law of the random variable Y counting the 
number of well served messages per day. 


(e) Compute a 0.900 confidence interval for the mean of this variable. 


By using the definitions, prove that the expected value of a negative ex- 
ponential distribution is +. Prove that the variance is 5. 


A certain type of light bulb has a constant probability to burn out in a 
day equal to 0.001. 


(a) What is the probability of a burn-out within 2 days? 
(b) On average, after how many days this happens? 


(c) And with which variance? 


The town council decides to put one autovelox (device which measures car 
velocity) on one of the city’s three main streets. In order to decide where 
to put the autovelox so as to punish the greatest number of infractions, 
police patrols A, B and C monitor the traffic on these three streets respec- 
tively for a specified period. 

Police patrol A, having all the time necessary during the day, decides to 
monitor the number of violations and finds an average of four per hour. 
Police patrol B, having only a few minutes per day to monitor the situa- 
tion, decides to check the time between the use of the autovelox and the 
first violation, discovering an average of 15 minutes. Police patrol C, hav- 
ing more time available than patrol B, decides to monitor how long it takes 
the autovelox to catch the second violation: an average of 30 minutes. In 
the standard hypothesis of the Poisson model: 


Exercises 377 


16. 


17. 


18. 


19. 


(a) Which is the best street for putting the autovelox in order to discover 
the most violations? 


(b) In order to discover at least one violation per day with a 0.950% 
probability, how much time must the autovelox be active per day in 
that street? 


(c) On Sunday, a fair will be held and to reach it people must take the 
street checked by patrol C. 1000 cars are estimated to arrive from 
that route. Knowing that one car driver out of 25 is undisciplined, 
what is the probability that at least 100 of them will be stopped? 


The height of the pines in a woods with 200 trees follows a normal distri- 
bution law with expected value 10.00m and standard deviation equal to 
0.90 m. 


(a) What is the probability that a tree randomly chosen measures less 
than 8.00 m in height? 


(b) What is the probability that among 10 trees randomly chosen there 
are at least two measuring less than 8.00 m? 


(c) What is the probability that in this woods there are 3 trees less than 
8.00 m in height? 


Let us consider a Gaussian random variable X with 4 as expected value 
and 3 as variance. 

Determine the probability that a specification of X is far from the expected 
value by no more than 0.4. Determine a lower bound to this probability 
in the case the distribution law is unknown. 


A manufacturing plant produces ball bearings, which are assumed to be 
satisfactory if their diameter is less or equal 14.005 mm and defective oth- 
erwise. The diameter of those ball bearings follows a Gaussian distribution 
law: we know that at present, on average, 20% of the items made have a 
diameter less than 14.002 mm and 9% a diameter greater then 14.004 mm. 


(a) Calculate the expected value of the diameter. 
(b) Calculate the variance. 


(c) The chief engineer decides to proceed with a general overhaul of the 
plant if the percentage of defective pieces is greater than 0.1%. What 
is he going to decide? 


Another factory produces water pipes whose length follows a Gaussian 
distribution with 1m as the expected value. If the pipe length is greater 
than 0.01% the item is deemed defective and has to be rejected; on average, 
there is one rejected item out of every 100. What is the variance of the 
distribution? 


378 


20. 


21. 


22. 


23. 


Random variables 


A foodstuff canning company declares a mean packed weight u = 800 gr 
for a given product. The process is completely automatic and it ensures 
a standard deviation ox = 5 gr. Every morning the producer analyses 10 
cans and decides to stop the production line if their total weight surpasses 
a fixed threshold. Denote x; the weight of the i-th package, s = T Xi 
and t the threshold 


(a) For which value to.91 of the threshold the probability of stopping the 
production is less than 0.01 when the actually packed mean weight 
is 800 gr.? Answer this question: 


i. without knowing the distribution law of X and basing your rea- 
soning on the sole value of ux (hint: use the Markov inequality); 
ii. without knowing the distribution law of X and basing your rea- 
soning on the sole values of ux and ox. Assume that the dis- 
tribution law of S to be symmetric around its mean value (hint: 
use the Chebyshev inequality); 
iii. assuming S to (approximately) follow a Gaussian distribution 
law. 


Denote to.o1 the threshold found above in the last hypothesis, and 
assume X to be a Gaussian random variable. Compute the proba- 
bility of no stopping of the production when the population mean is 
equal to 820 gr. 


~ 
o` 
Ww 


Monthly, the producer stops the production to recalibrate the plant. The 
machinery is equipped with a lever which modifies the mean of a quantity 
Ap = 0.5Aa, where a is the angle between the lever and a line orthogonal 
to the earth (it is expressed in degrees: positive values correspond to 
shifts to the right, negative values to shifts to the left.). On the basis 
of the X sample: {817, 814, 824, 817, 816, 823, 822, 816, 822,820} and the 
assumption on X to be Gaussian: 


(c) Compute the distribution law of Mx (ux is one of its specifications.) 


(d) Estimate a value Aa of the angle of the lever suitable for restoring 
the right mean value of the canned weight. 


(e) Compute the value of Aa which ensures that ji, is greater than 800 
gr with probability equal to 0.900. 


Let Y = log X where X is a negative exponential random variable. Find 
the density function of Y. 


Let Y = X? and fx(x) = $1(-0,9)(x),0 > 0. Compute the c.d.f of X and 
Y and the related expected value. Compute the d.f. of Y. 


Let X be a random variable with ux as the expected value and c% as 
the variance. Let Y = 2X +1 and (X1, Xo,..., Xn) a sample drawn from 
X. Propose a weakly unbiased estimator for the expected value for Y and 
calculate its variance as function of o%. 


Exercises 379 


24. 


25. 


26. 


27. 


28. 


29. 


30. 


In order to estimate the rate of inflation consequent to the introduction 
of the Euro currency, the price rise of a common consumer good is being 
measured in 10 supermarkets in a big city obtaining that the sum of all 
price rises is equal to 15 Euro and the sum of their squares is 20 Euro. 
Compute a weakly unbiased estimate of both the mean price rise and its 
variance. 


Show that the recurrence relation (B.36) among p;’s extends to the Poisson 
distribution as follows: 


i+1 
Numerically verify how the empirical c.d.f. of a sample of random numbers 
drawn uniformly in [0,1] through Algorithm B.1 converges to the c.d.f. of 
the uniform distribution as the sample size increases. 


Implement a random numbers generator drawing uniformly in [0,1], using 
the following recurrence relation instead of B.17 


oe (ant,-1 +c) modn (B.164) 
n 


where c is an integer number and the other parameters remain unchanged. 
Verify how the produced samples are indistinguishable from the ones gen- 
erated through B.17. 


In order to approximate m (the length of a circumference of unitary diam- 
eter), consider the random experiment consisting in i) drawing a unitary 
edge square Q, ii) inscribing in it a circle C, iii) throwing m points in 
Q whose coordinates are uniform numbers in [0,1] labeling by 1 points 
belonging to C and by 0 the remaining ones, and iv) computing the ratio 
between the sum of the attributed labels and m. How large should m be 
in order to have the approximation error starting from the k-th significant 
digit? (Hint: use the large numbers law (2.181)). 


Consider an infinite sequence of random variables uniformly distributed 
in [0,1]. Let N be the random variable indicating the position of the first 
item in the sequence being bigger than its successor. Show that E[N] = e 
(the basis of natural logarithms) and use this fact to approximate e (Hint: 
if N > n only one of the n! permutations of uniform specifications is 
allowed in the first n elements of the sample). 


The proposed algorithm for generating a binomial random variable B of 


parameters n € N and p € R requires meanly np + 1 iterations, so that 
if p > 4 it will be (meanly) less costly to generate a binomial random 


variable B’ of parameters n and 1— p. Show that B and n — B’ have the 
same distribution law. 


380 Random variables 


31. Let Uj,...,U, iid. random variables uniformly distributed over [0, 1]. 
Prove (for instance by mathematical induction) the following relation 


n-1 k 
fonua) = >> a = Di (Sone = nn) Tk,k+1(t)  (B.165) 


h=0 


Numerically verify how }>;"_, U; tends to a Gaussian random variable of 


mean 5 and variance 5 as n — +00. 


32. Prove that for any sample {X1,..., Xm} of any random variable X: 
e the c.d.f. of Z = min{ X1, ..., Xm} is Fz(z) = 1 — (1 — Fx(z))™; 
e the c.d.f. of Y = max{ X1, ..., Xm} is Fy (y) = Fx (y)”. 

33. Prove that for any random variable X: 
e the d.f. of Z = (X — a)/b is fz(z) = fx (a + bz)b; 
e the c.d.f. of same Z is Fz(z) = Fx (a + bz). 


34. Prove that for any sample {X1,..., Xm} of any random variable X with 
E| X] = u and V[X] = o? for big values of m: 


e the distribution law of M = zim Zeis approximable by a Gaussian 
variable Y with mean E[M] = p and variance V[M] = o? /m; 


e the distribution law of S? = Dim? 


sian variable Y with mean E[S?] = o? and V[S?] depending on the 
shape of the X distribution law. 


is approximable by a Gaus- 


C — Markov Chains 


Definition C.1. Given a finite set ¥ = {21,...,2n}, a discrete time Markov 
chain is a pair of infinite sequences: 


e w = ((1),..., a(t),...) of state vectors a(t), where each a(t) is a prob- 
ability distribution {pz,(t),..., Dex, (t)} over X, and 


e P= (P(1),...,P(t),...) of transition matrices P(t), where P(t) is a ma- 
trix whose ij** element p;;j(t) measures the conditional probability of ob- 


serving x; at time t+ 1 once x; has been observed at time t. 


Henceforth we will focus on Markov chains, that we denote homogeneous, with 
P(t) = P for all t. Considering a grid with repeated columns sequentially 
associated to a time stamp, where each cell in the column is a site denoted by a 
zi, you may visualize the chain as an infinite set of trajectories passing through 
sites. To get an idea of this, you may think of the sites as the countries of a 
state, the trajectories as the connecting roads and the traffic intensity gives a 
measure of the probability that a car crosses a given country at a given time. 
More schematically, you may identify the chain with the following experi- 
ment. You have n urns each containing a certain number of balls, each one 
labeled by the index of one of the urns (the one containing the ball included). 
Let Nij be the number of balls labeled by j in the urn 7. The experiment consists 
in a neverending sequence of extractions: at time t on a given urn, you draw a 
ball and move at time t + 1 to the urn whose index is marked on the ball and 
repeat the extraction. Considering an infinite number of trajectories starting at 
time 0 from any of the n urns according to the probability distribution (0), 
you observe at time t trajectories crossing the sites with asymptotic frequencies 
coinciding in probability with a(t), where the asymptotic frequency of passing 
from site i to site j tends to pi; = SNr 
Fact C.1. In a Markov chain: 


e the state vector is a stochastic vector; i.e. all components are > 0 and 
their sum equals 1. 


381 


382 Markov Chains 
e the transition matrix is a stochastic matrix; i.e. its rows are stochastic 
vectors. 


Fact C.2. Given the meaning of the transition matrix, the dynamics of the state 
vector is the following: 


w(t+1) =p" (t)P (C.1) 
Hence 
pilt +1) = So v(t) Pj (C.2) 
j=l 
with) =P HP = p opr (C.3) 
prt = 5 PEP (Chapman-Kolmogorov equation) (C.4) 


p= 


Fact C.3. The flow of transitions across site i at a certain time t is ruled by 
the following equation 


pilt + 1) — yilt) = D (Wi (t)Pji — pilt)Pij) (master equation) (C.5) 


Check: 
pilt +1) — y(t) = bs pilt) Pj — pilt) D Pij (C.6) 


where the last sum in the brackets equals 1. 


Remark C.1. The master equation states that a probability increment in site 
i is due to the difference between the probabilities of incoming and outcoming 
this state. A stationarity condition occurs when this difference is null in each 
site. This means that w(t) does not change at time t, i.e. from time t on, 
given the dynamics (C.1) of the process. A sufficient condition for having this 
difference vanishing is what we call the detailed balance, which occurs when 


Wi (OP i — Vi(t)Piy =0 Vi, j (C.7) 


Still given the dynamics (C.1), the evolution of the state vector of a Markov 
chain is completely determined by its initial value (0) and the transition matrix 
P. In particular, the conditions for having an asymptotic stationary state affect 
only P. Namely: 


Theorem C.1. If P is ann x n stochastic matrix such that: 


e for at least one i € {1,...,n}, Pa > 0 


383 


e for each h,k € {1,...,n} there exists an m such that P; > 0 


then any Markov chain having transition matrix P has one and only one sta- 
tionary state whose components are all greater than 0. 


We will not check the theorem. Rather we make the following remarks. 
Remark C.2. 


e The stationary state Y, does not depend on the initial state a(0); hence 
it depends only on the transition matrix. 


e The conditions of Theorem C.1 ensure that each site of X will be visited 
by the chain with probability greater than 0 (Absence of prejudice). 


D — Computational complexity 


In this appendix we just outline a few definitions that let us formally identify 
the computational complexity of a numerical task with sole reference to the 
necessary computation time. Thus we will not mention the other important 
resource binding our numerical implementations represented by the mass mem- 
ory. Computational complexity is a broad research and application field. For 
starters, the reader can take a look at two classical books cited in the references 
[Garey and Johnson, 1978, Papadimitriou, 1994]. The definitions we give refer 
to any common general purpose computing device like the PC on our desk. We 
give a formal description of its properties by referring to a very essential yet 
equivalent device: the Deterministic Turing Machine. 


Definition D.1. A Deterministic Turing Machine (DTM) is a computing ma- 
chine that consists in a finite state control device (mind the CPU of your PC), 
plus a read-write head operating on an infinite tape divided into squares indexed 
by integers (your large — though non infinite — capacity hard disk) (see Fig. D.1) 


a 
Definition D.2. A program æ for a DTM, generally speaking an algorithm, is 
given by the following specifications: 


1. A finite set T of tape symbols including a blank. 


control unit 


= 


22 read/write head 


. tape 


Fig. D.1: Scheme of the Turing Machine. 


385 


386 Computational complexity 


2. A finite set S of states, including a start-state so and two halt states: yes 
Sy and no-state sn. 


3. A transition function ô : (S — {sy,s,}) x T = Sx T x {—1,0, +1}. If 
we denote by T* the set of all finite strings from T, including the empty 
string, the input (otherwise named instance of æ) is a string x € T* whose 
first element is in the square 1 of the tape and the rest are consecutively 
stored on the right of it; all the squares either preceding or following x 
contain the blank symbol. 


A DTM computation runs in the following way. © starts with the control 
device in the state so, and the read-write head scanning tape at square 1. The 
control device evolves according to 6, which on the basis of the current tape 
symbol and state determines a new state, the symbol to be printed and the 
successive shift of the head to the left (—1), to the right (+1) or no move (0). 
Each 6 application represents one step in DTM. æ with input x — that we 
denote (x) — stops when the control device enters the state sy or Sn, which 
respectively denote the string x as accepted or rejected by &. In this sense we 
say that & solves a decision problem. 

We collect the set of instances of a computational problem into a language. 
Namely 


Definition D.3. For any finite set T of symbols, a language L is any subset of 
T*. If each element of L is accepted by æ, then we call it the language La and 
say that it is recognized by /. 


Definition D.4. For a language L, and its related decision problem, the (time) 
computational complexity of L is a function t : Nt N such that: 


t(n) = min { max {number of steps before (x) halts} } (D.1) 


E @ recognizing L | eT" AL 
For a DTM program of which stops for each input x € T*, its (time) complexity 


T(n) is defined as follows: 


T(n) = max {number of steps before (x) halts} (D.2) 


We distinguish between easy and difficult languages (and then problems) in 
function of their computational complexity. 


Definition D.5. The class P is the class of decision problems L such that there 
exists a polynomial complexity program @ recognizing L. For short you accept 
or reject with æ any x € T” in a time polynomial in n. E 


387 


We essentially identify with P and its search version the class of feasible 
problems. 


Definition D.6. A search problem is a concatenation of decision problems, each 
generating an additional bit of the solution. A search problem has a polynomial 
complexity if the length of its solution is polynomial and the complexities of the 
involved decision problems are polynomial too. 


Besides these classes we have a wide spectrum of unfeasible problems whose 
complexity ranges from a superpolynomial function up to any kind of unbear- 
ably growing functions. A very common family of problems, including problems 
that on the one hand occur in everyday life and on the other are open to getting 
approximate solutions in feasible time is rooted in the NP class. It is a class of 
decision problems whose complexity is appreciated in the ideal computational 
framework represented by a non deterministic Turing machine, i.e. a Turing ma- 
chine that performs many computations in parallel. This attitude is described 
by the following definition. 


Definition D.7. A Non Deterministic Turing Machine (NDTM) is a DTM en- 
dowed with more than one read-write head and a parallel transition function 
ô : (S — {8y,5n})x T > Q8xTx{-1,0,+1) | 


Thus, with a tape character in input and a non terminal state, the function 
might generate more than one triplet (state, character to print, next move). This 
means that the read-write head splits in many replicas which print characters 
and move along the tape. 

On this machinery the NP class is an extension of the P class so defined: 


Definition D.8. Class NP is the class of decision problems L having polynomial 
complexity on a NDTM, i.e. such that for each L there exists a NDTM program 
æ% accepting each x € L in a time polynomial in the length of x. 


None showed that class P does not coincide with class NP (the famous 
P Æ NP dilemma) but neither did anyone prove the contrary. Namely: 


e a special set of problems, the NP-complete class, have the property that 
any other problem in NP can be quickly (that is, in polynomial time) 
converted in each one of them. Thus if an algorithm exists solving an NP- 
complete problem in polynomial time, then other algorithms must exist 
solving quickly any problem in NP; 


e no one has ever been able over the course of about half a century (since 
the formalization of the class) to discover an algorithm solving an NP- 
complete problem in polynomial time. 


388 Computational complexity 


So we are used to conjecturing that such an efficient algorithm does not exist, 
hence P Æ NP. 

The extension of the NP class to search problems gives rise to two (possibly 
coinciding) classes of problems. 


Definition D.9. A problem belongs to the NP-easy class if its solution requires 
the solution of a polynomial number of NP decision problems. 


Definition D.10. A problem belongs to the NP-hard class if the existence of 
any algorithm solving it in polynomial time would imply the existence of an 
algorithm solving a NP-complete problem in polynomial time too, hence P = 
NP. 


| 
Remark D.1. In spite of the names, an NP-easy problem is NP-hard too. 
Oo 


A typical way to overcome the complexity of NP problems is to introduce some 
random steps in the computation. While not ensuring any solution in feasible 
time, it offers in many cases a good probability of the same. This is due solely 
to the fact that random steps break any possible stubbornly wrong calculation 
strategy. The extension of a DTM to a probabilistic Turing machine is the 
following. 


Definition D.11. A Probabilistic Turing Machine (PTM) is a Turing machine 
whose transition function is: 6: (S — {sy,5,}) x T x Ro Sx T x {—1,0, +1} 
where R is an infinite set of independent random bits having equal probability 
of being set either to 0 or 1. 


For each step, 6 takes in input one of these bits in addition to the current 
character and state for computing the triplet (next state, next character, next 
move). Actually, 6 may use or ignore the drawn bit, depending on the algorithm 
implemented. 

There is a very detailed taxonomy of the classes of problems solvable by this 
machinery, depending on the running time and correct solution probabilities. 
Here we just mention the RP class that is recalled in the book. 


Definition D.12. A problem z belongs to the RP class if an algorithm A exists 
running in polynomial time on a PTM such that denoting by L, the related 
language, P(% (x) outputs sy|z € Lr) > 4 and P(g (x) outputs sye ¢ Lr) = 


0. 
E 


389 


DTM, function and algorithm are three different ways of meaning an uni- 
vocal map from input to output of our computer. DTM mostly evokes the 
physical mechanism of the computation. Rather, function defines the mapping 
in mathematical terms; while algorithm describes it in terms of logical instruc- 
tions. Consider the set of all function computable by a DTM. According to the 
Church-Turing thesis [Kleene, 1967] they represent all functions computable by 
any computing device. They are denoted as partial recursive functions (PRF) 
as they: i) result from the iterative application of a limited set of primitive 
functions — recursiveness — and ii) their output is not necessarily defined for 
each input (that happens whenever your PC enters a neverending loop) — par- 
tiality. Now your PC may compute anyone of these functions, provided you fed 
it with a higher level input constituted by the computation program. Namely, 
before supplying the input of the function you want to compute, you supply the 
machine a set of instructions for simulating the DTM that computes exactly 
this function. For this reason you speak of your PC and any DTM with these 
properties as of a universal DTM. Namely: 


Definition D.13. A universal DTM is a DTM able to compute any PRF, i.e. 
to simulate any other DTM. 


Obvious extension exist of this universality notion to universal NDTM and 
universal PTM. In particular our PC is a universal prefix machine, according 
to Definition 5.1. 


Bibliography 


[Aarts and Korst, 1989] Aarts, E. and Korst, J. (1989). Simulated annealing 
and Boltzmann machines: a stochastic approach to combinatorial optimiza- 
tion and neural computing. Wiley-Interscience series in discrete mathematics 
and optimization. John Wiley, Chichester. 


Aho et al., 1974] Aho, A., Hopcroft, J., and Ullman, I. (1974). The Design and 
Analysis of Algorithms. Addison-Wesley, Reading. 


Altman, 2005] Altman, D. G. (2005). Why we need confidence intervals. World 
Journal of Surgery. 


Altman and Bland, 1995] Altman, D. G. and Bland, J. M. (1995). Absence of 
evidence is not evidence of absence. British Medical Journal. 


Amari, 1998] Amari, S. (1998). Natural gradient works efficiently in learning. 
Neural Computation, 10:251-276. 


Amari et al., 1992] Amari, S., Kurata, K., and Nagaoka, H. (1992). Informa- 
tion geometry of boltzmann machines. IEEE Transactions on Neural Net- 
works, 3:260-271. 


[Amato et al., 1991] Amato, S., Apolloni, B., Caporali, P., Madesani, U., and 
Zanaboni, A. M. (1991). Simulated annealing in back-propagation. Neuro- 
computing, 3:207—-220. 


Amit et al., 1985] Amit, D., Gutfreund, H., and Sommpolinsky, H. (1985). 
Sping-glass models of neural networks. Physical Review, A32:1007-1018. 


Angluin and Laird, 1988] Angluin, D. and Laird, P. D. (1988). Learning from 
noisy examples. Machine Learning, 2(2):343-370. 


Anthony and Bartlett, 1999] Anthony, M. and Bartlett, P. L. (1999). Neu- 
ral network learning: theoretical foundations. Cambdridge University Press, 
Cambridge. 


[Apolloni, 1992] Apolloni, B. (1992). Design of algorithms for neural network 
supervised learning. Computers and Artificial Intelligence, 11(5):457—480. 


391 


392 BIBLIOGRAPHY 


[Apolloni, 1998] Apolloni, B. (1998). What size needs testing? In Marinaro, 
M. and Tagliaferri, R., editors, Neural Nets WIRN Vietri-97. Proceedings of 
the 9th Italian Workshop on Neural Nets, pages 115-123, London. Springer. 


[Apolloni et al., 1992] Apolloni, B., Armelloni, A., Bollani, G., and de Falco, 
D. (1992). Some experimental results on asymmetric boltzmann machines. 
In Garrido, M. S. and Vilela Mendes, R., editors, Complexity in Physics and 
Technology, pages 151-166. World Scientific, Singapore. 


[Apolloni et al., 2002a] Apolloni, B., Baraghini, F., and Palmas, G. (2002a). 
PAC meditation on boolean formulas. In Proceedings of SARA2002: Ab- 
straction, Reformulation and Approximation, pages 274-281. 


[Apolloni et al., 2003a] Apolloni, B., Bassis, S., Gaito, S., and Malchiodi, D. 
(2003a). Cooperative games in a stochastic environment. In Apolloni, B., 
Marinaro, M., and Tagliaferri, R., editors, Proceedings of WIRN2003: Italian 
Workshop on Neural Networks, Lecture Notes in Computer Science, Berlin. 
Springer. 


[Apolloni et al., 2005a] Apolloni, B., Bassis, S., Gaito, S., and Malchiodi, D. 
(2005a). Tight bounds for SVM classification error. In Zhao, M. and Shui, 
Z. E., editors, Proceedings - 2005 International Conference on Neural Network 
& Brain (ICNN&B’05), pages 5-8. IEEE Press. 


[Apolloni et al., 2006a] Apolloni, B., Bassis, S., Gaito, S., and Malchiodi, D. 
(2006a). Appreciation of medical treatments by learning underlying functions 
with good confidence. Current Pharmaceutical Design. to appear. 


[Apolloni et al., 2005b] Apolloni, B., Bassis, S., Gaito, S., Malchiodi, D., and 
Minora, A. (2005b). Computing confidence intervals for the risk ofa SVM 
classifier through algorithmic inference. In Apolloni, B., Marinaro, M., and 
Tagliaferri, R., editors, Biological and Artificial Intelligence Environments, 
pages 225-234. Springer. 


[Apolloni et al., 2002b] Apolloni, B., Bassis, S., Malchiodi, D., and Gaito, S. 
(2002b). Cooperative games in a stochastic environment. In Damiani, 
E., Howlett, R. J., Jain, L. C., and Ichalkaranje, N., editors, Knowledge- 
Based Intelligent Information Engineering Systems and Allied Technologies 
- KES 2002 (Proceedings of KES’2002: Sixth International Conference on 
Knowledge-Based Intelligent Information & Engineering Systems), pages 
296-300. 


[Apolloni et al., 1999] Apolloni, B., Battistini, E., and de Falco, D. (1999). 
Higher order boltzmann machines and entropy. Journal of Physics A. Math- 
ematical and General, 32:5529-5538. 


[Apolloni et al., 1991a] Apolloni, B., Bertoni, A., Campadelli, P., and de Falco, 
D. (1991a). Asymmetric boltzmann machines. Biological Cybernetics, 66:61- 
70. 


BIBLIOGRAPHY 393 


[Apolloni et al., 2005c] Apolloni, B., Brega, A., Malchiodi, D., Orovas, C., and 
Zanaboni, A. M. (2005c). A fuzzy method for learning simple boolean formu- 
las from examples. In Halgamuge, S. K. and Wang, L., editors, Computational 
Intelligence for Modelling and Prediction, Studies in Computational Intelli- 
gence, Vol. 2, chapter 26, pages 367-382. Springer. 


[Apolloni et al., 2003b] Apolloni, B., Brega, A., Malchiodi, D., Palmas, G., and 
Zanaboni, A. M. (2003b). Learning rule representations from boolean data. 
In Kaynak, O., Alpaydin, E., Oja, E., and Xu, L., editors, Artificial Neural 
Networks and Neural Information Processings - ICANN/ICONIP 2008 Joint 
International Conference, Lecture Notes in Computer Science, pages 875-882, 
Berlin. Springer. 


[Apolloni et al., 2006b] Apolloni, B., Brega, A., Malchiodi, D., Palmas, G., and 
Zanaboni, A. M. (2006b). Learning rule representations from data. IEEE 
Transactions on System, Man and Cybernetics, Part A. to appear. 


[Apolloni and Chiaravalli, 1997] Apolloni, B. and Chiaravalli, S. (1997). Pac 
learning of concept classes through the boundaries of their items. Theoretical 
Computer Science, 172:91-120. 


[Apolloni and de Falco, 1990] Apolloni, B. and de Falco, D. (1990). Learning 
by feedforward boltzmann machine. In Novàk, M. and Pelikan, E., editors, 
Theoretical aspects of neural computing, pages 94-102. World Scientific. 


[Apolloni et al., 2004a] Apolloni, B., Esposito, A., Malchiodi, D., Orovas, C., 
Palmas, G., and Taylor, J. (2004a). A general framework for learning rules 
from data. IEEE Transactions on Neural Networks, 15(6). 


[Apolloni et al., 2004b] Apolloni, B., Esposito, A., Malchiodi, D., Orovas, C., 
Palmas, G., and Taylor, J. G. (2004b). A general framework for learning rules 
from data. IEEE Transactions on Neural Networks, 15:1333-1349. 


[Apolloni and Gentile, 1998] Apolloni, B. and Gentile, C. (1998). Sample size 
lower bounds in pac learning by algorithmic complexity tehory. Theoretical 
Computer Science, 209:141-162. 


[Apolloni and Gentile, 2000] Apolloni, B. and Gentile, C. (2000). p-sufficient 
statistics for pac learning k-term dnf formulas through enumeration. Theo- 
retical Computer Science, 230:1-37. 


[Apolloni and Kurfess, 2002] Apolloni, B. and Kurfess, F., editors (2002). From 
Synapses to Rules — Discovering Symbolic Rules from Neural Processed Data. 
Kluwer Academic/Plenum Publishers, New York. 


[Apolloni and Malchiodi, 2001] Apolloni, B. and Malchiodi, D. (2001). Gaining 
degrees of freedom in subsymbolic learning. Theoretical Computer Science, 
255:295-321. 


394 BIBLIOGRAPHY 


[Apolloni et al., 2001a] Apolloni, B., Malchiodi, D., Gaito, S., and Zoppis, I. 
(2001a). Twisting features with properties. In Marinaro, M. and Tagliaferri, 
R., editors, Proceedings of WIRN2001: Italian Workshop on Neural Networks, 
pages 301-312, Berlin. Springer. 


[Apolloni et al., 2002c] Apolloni, B., Malchiodi, D., Orovas, C., and Palmas, G. 
(2002c). From synapses to rules. Cognitive Systems Research, 3/2:167-201. 


[Apolloni et al., 2002d] Apolloni, B., Malchiodi, D., Orovas, C., and Zanaboni, 
A. M. (2002d). Fuzzy methods for simplifying a fuzzy formula inferred from 
examples. In Proceedings of the First International Conference on Fuzzy Sys- 
tems and Knowledge Discovery, volume 2, pages 554-558. 


[Apolloni et al., 2001b] Apolloni, B., Piccolboni, A., and Sozio, E. (2001b). A 
hybrid symbolic subsymbolic controller for complex dynamic systems. Neu- 
rocomputing, 37:127-163. 


[Apolloni et al., 1991b] Apolloni, B., Pisano, R., and de Falco, D. (1991b). The 
boltzmann machine: Experimental results and theoretical perspectives. In 
Probabilistic methods in mathematical physics, pages 58-69, Singapore. World 
Scientific. 


[Apolloni and Zoppis, 1999] Apolloni, B. and Zoppis, I. (1999). Subsymboli- 
cally managing pieces of symbolical functions for sorting. IEEE Transactions 
on Neural Networks, 10(5):1099-1122. 


[Aratyn and Rasinariu, 2005] Aratyn, H. and Rasinariu, C. (2005). A Short 
Course in Mathematical Methods with Maple. World Scientific Publishing 
Company. 


[Auer and Long, 1999] Auer, P. and Long, P. M. (1999). Structural results 
about on-line learning models with and without queries. Machine Learning, 
pages 147-181. 


[Baba and Jain, 2001] Baba, N. and Jain, L. C., editors (2001). Computational 
Intelligence in Games. Studies in Fuzziness and Soft Computing, Volume 62. 
Physica Verlag, Berlin. 


[Bartlett and Mendelson, 2002] Bartlett, P. L. and Mendelson, S. (2002). 
Rademacher and Gaussian complexities: Risk bounds and structural results. 
Journal of Machine Learning Research, 3:463-482. 


[Bartlett and Williamson, 1991] Bartlett, P. L. and Williamson, R. C. (1991). 
Investigating the distribution assumptions in the pac learning model. In Proc. 
4th Workshop on Computational Learning Theory, pages 24-32, San Mateo, 
CA. Morgan Kaufmann. 


[Baum and Haussler, 1989] Baum, E. B. and Haussler, D. (1989). What size 
net gives valid generalization? Neural Computation, 1:151-160. 


BIBLIOGRAPHY 395 


[Bellman, 1961] Bellman, R. (1961). Control Process: A Guided Tour. Prince- 
ton University Press, NJ. 


[Ben-David et al., 1997] Ben-David, S., Kushilevitz, E., and Mansour, Y. 
(1997). Online learning versus offline learning. Machine Learning, 29:45- 
63. 


[Benedek and Itai, 1988] Benedek, G. M. and Itai, A. (1988). Learnability by 
fixed distributions. In Proceedings of the first annual workshop on Computa- 
tional learning theory, pages 80-90. Morgan Kaufmann Publishers Inc. 


[Benedek and Itai, 1991] Benedek, G. M. and Itai, A. (1991). Learnability with 
respect to fixed distributions. Theoretical Computer Science, 86:377—-389. 


(Billingsley, 1995] Billingsley, P. (1995). Probability and Measure. Wiley Series 
in Probability and Mathematical Statistics. John Wiley & Sons, New York, 
third edition. 


Bishop, 1995] Bishop, C. M. (1995). Neural networks for pattern recognition. 
Clarendon Press, Oxford. 


Blum and Rivest, 1992] Blum, A. and Rivest, R. (1992). Training a 3-node 
neural network is np-complete. Neural Networks, 5:117-127. 


Blum and Blum, 1975] Blum, L. and Blum, M. (1975). Toward a mathematical 
theory of inductive inference. Information and Control, 28(2):125-155. 


Blumer et al., 1989] Blumer, A., Ehrenfreucht, A., Haussler, D., and Warmuth, 
M. (1989). Learnability and the vapnik-chervonenkis dimension. Journal of 
the ACM, 36:929-965. 


[Boracchi and Biganzoli, 2001] Boracchi, P. and Biganzoli, E. (2001). General- 
ized linear models for the hazard function of survival data. Biocybernetics 
and Biomedical Engineering, 21:39-54. 


(Breiman, 1986] Breiman, L. (1986). Bagging predictors. Machine Learning, 
24(2):123-140. 


[Carpenter and Grossberg, 1988] Carpenter, G. and Grossberg, S. (1988). The 
art of adaptive pattern recognition by a self-organizing neural network. IEEE 
Computer, 21(3):77-88. 


[Chentsov, 1963] Chentsov, N. N. (1963). Evaluation of an unknown distribu- 
tion density from observations. Soviet Math. Docl., 4:1559-1562. 


(Chernoff, 1952] Chernoff, H. (1952). A measure of asymptotic efficiency for 
tests of a hypothesis based on the sum of observations. Annals of Mathemat- 
ical Statistics, 23:493-509. 


396 BIBLIOGRAPHY 


[Clarckson et al., 1992] Clarckson, T. G., Gorse, D., Taylor, J. G., and Ng, 
C. K. (1992). Learning probabilistic ram nets using vlsi structures. [EEE 
Transactions on Computers, 41(12):1552-1561. 


Copernicus, 1543] Copernicus, N. (1543). De Revolutionibus Orbium 
Coelestium, volume 1. 


Cortes and Vapnik, 1995] Cortes, C. and Vapnik, V. (1995). Support-vector 
networks. Machine Learning, 20:121-167. 


Courant and Robbins, 1966] Courant, R. and Robbins, H. (1966). What is 
Mathematics? Oxford University Press. 


Cox, 1998] Cox, E. (1998). The fuzzy systems handbook. AP Professional, San 
Diego. 


Cramér, 1958] Cramér, H. (1958). Mathematical Methods of Statistics. Prince- 
ton University Press, Princeton. 


Cristianini and Shawe-Taylor, 2000] Cristianini, N. and Shawe-Taylor, J. 
(2000). An introduction to Support Vector Machines and other kernel-based 
learning methods. Cambridge University Press, Cambridge. 


Cybenko, 1989] Cybenko, G. (1989). Approximation by superpositions of sig- 
moidal functions. Mathematics of Control, Signals and Systems, 2:303-314. 


de Boor, 1978] de Boor, C. (1978). A Practical Guide to Splines. Springer- 
Verlag, New York. 


De Finetti, 1975] De Finetti, B. (1975). Theory of Probability. Vol. 2: a Critical 
Introductory Treatment. John Wiley & Sons, New York. 


Devroye et al., 1996] Devroye, L., Gyérfi, L., and Lugosi, G. (1996). A prob- 
abilistic theory of pattern recognition. Applications mathematics. Stochastic 
modelling and applied probability. Springer-Verlag, New York. 


Diffie and Hellman, 1976] Diffie, W. and Hellman, M. E. (1976). New direc- 
tions in cryptography. IEEE Trans. On Info. Theory, 22:644-654. 


Dorffner, 1997] Dorffner, G., editor (1997). Neural networks and a new artifi- 
cial intelligence. International Thomson Computing Press, London. 


Duda and Hart, 1973] Duda, R. O. and Hart, P. E. (1973). Pattern classifica- 
tion and scene analysis. John Wiley & Sons, New York. 


Dynkin, 1951] Dynkin, E. B. (1951). Necessary and sufficient statistics for 
a family of probability distributions. Uspekhi Matematicheskikh Nauk,, 6(1 
(41)):68-90. English Translation (published by the London Math. Soc. and 
the British Library) Russian Mathematical Surveys. 


BIBLIOGRAPHY 397 
Efron, 1982] Efron, B. (1982). The Jacknife, the Bootstrap and Other Resam- 
pling Plans. SIAM, Philadephia. 


Efron and Tibshirani, 1993] Efron, B. and Tibshirani, R. (1993). An introduc- 
tion to the Boostrap. Chapman and Hall, Freeman, New York. 


Erdélyi et al., 1981] Erdélyi, A., Magnus, W., Oberhettinger, F., and Tricomi, 
F. G. (1981). Definition of the g-function. In Higher Transcendental Func- 
tions, Vol. 1, pages 206-222. Krieger, New York. 


Falconer, 1960] Falconer, K. (1960). Fractal Geometry — Mathematical Foun- 
dations and Applications. John Wiley & Sons, New York. 


Fellenz et al., 2000] Fellenz, W. A., Taylor, G. J., Cowie, R., Douglas-Cowie, 
E., Piat, F., Kollias, S., Orovas, C., and Apolloni, B. (2000). On emotion 
recognition of faces and speech using neural networks, fuzzy logic and the 
assess system. In Amari, S., Lee Giles, C., Gori, M., and Piuri, V., editors, 
Proceeding of the IEEE-INNS-ENNS International Joint Conference on Neu- 
ral Networks - IJCNN 2000, pages I1.93-II.98, Los Alamitos. IEEE Computer 
Society. 


[Feller, 1960] Feller, W. (1960). An Introduction to Probability Theory and Its 
Applications, volume 1. John Wiley & Sons, second edition. 


[Fischer, 2003] Fischer, I. (Accessed June 2003).  Javanns — university 
of Tübingen web site. http://www-ra.informatik.uni-tuebingen.de 
/software/JavaNNS/welco me_e.html. 


Fisher, 1925] Fisher, M. A. (1925). On the mathematical foundations of theo- 
retical statistics. Philosophical Transactions of the Royal Society of London 
Ser. A, 222:309-368. 


Fisher, 1935] Fisher, M. A. (1935). The fiducial argument in statistical infer- 
ence. Annals of Eugenics, 6:391-398. 


Fishwick, 2003] Fishwick, P. (Accessed March 2003). Simpack toolkit. 
http://www.cise .ufl.edu/~fishwick/simpack/simpack. html. 


Flammini et al., 9892] Flammini, M., Marchetti-Spaccamela, A., and Kucera, 
L. (19892). Learning dnf formulas under classes of probability distributions. 
In Proc. 5th Workshop on Computational Learning Theory, pages 85-92, San 
Mateo, CA. Morgan Kaufman. 


Florens et al., 1990] Florens, J. P., Mouchart, M. M., and Rolin, J. M. (1990). 
Elements of bayesian statistics. Marcel Dekker, Inc., New York. 


Fraser, 1957] Fraser, D. A. S. (1957). Nonparametric Methods in Statistics. 
John Wiley, New York. 


Fraser, 1958] Fraser, D. A. S. (1958). Statistics. An Introduction. John Wiley 
& Sons, London. 


398 BIBLIOGRAPHY 


[Freireich et al., 1963] Freireich, E. J., Gehan, E., Schroeder, L. R., Wolman, 
I. J., Anbari, R., Burgert, E. O., Mills, S. D., Pinkel, D., Selawry, J. H., Moon, 
B. R. G., Spurr, C. L., Storrs, R., Haurani, F., Hoogstraten, B., and Lee, S. 
(1963). The effect of 6-mercaptopurine on the duration of steroid-induces 


remissions in acute leukemia: a model for evaluation of other potentially 
useful therapy. Blood, 21:699-—716. 


[Freund and R.E., 1995] Freund, Y. and R.E., S. (1995). A decision-theoretic 
generalization of on-line learning and an application to boosting. In Proc. IT 
European Conference on Computational Learning Theory, Barcellona, March 
95. 


[Fritzke, 1995] Fritzke, B. (1995). A growing neural gas network learns topolo- 
gies. In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors, Advances 
in Neural Information Processing Systems, volume 7, pages 625-632, Cam- 
bridge, MA. MIT Press. 


[Garey and Johnson, 1978] Garey, M. R. and Johnson, D. S. (1978). Computer 
and Intractability: a Guide to the Theory of NP-Completeness. W. H. Free- 
man, San Francisco. 


Geisser, 1993] Geisser, S. (1993). Predictive Inference: An Introduction. Chap- 
mann & Hall, New York. 


Geman et al., 1992] Geman, S., Bienenstock, E., and Doursat, R. (1992). Neu- 
ral networks and the bias/variance dilemma. Neural Computation, 4:1-58. 


Gersho and Gray, 1992] Gersho, A. and Gray, R. M. (1992). Vector Quantiza- 
tion and Signal Processing. Kluwer, Boston, MA. 


Glivenko, 1933] Glivenko, V. (1933). Sulla determinazione empirica delle leggi 
di probabilità. Giornale dell’Istituto Italiano degli Attuari, 3:92-99. 


Gold, 1967] Gold, E. (1967). Language identification in the limit. Information 
and Control, 10:447—474. 


Goldman and Sloan, 1994] Goldman, S. A. and Sloan, R. H. (1994). The power 
of self-directed learning. Machine Learning, 14:271—-294. 


Goldreich, 1999] Goldreich, O. (1999). Modern cryptography, probabilistic 
proofs, and pseudorandomness. Algorithms and combinatorics. Springer, 
Berlin. 


[Gomez, 1999] Gomez, C., editor (1999). Engineering and Scientific Computing 
with Scilab. Birkhauser. 


[Goodwin and Paine, 1977] Goodwin, G. C. and Paine, R. L. (1977). Dynamic 
system identification: experiment design and data analysis. Mathematics in 
science and engineering. Academic Press, New York. 


BIBLIOGRAPHY 399 


[Hanson and Pratt, 1989] Hanson, S. J. and Pratt, L. Y. (1989). Comparing 
biases for minimal network construction with back-propagation. In Touretzky, 


D. S., editor, Advances in Neural Information Processing, volume 1, pages 
177-185. 


[Haussler, 1989] Haussler, D. (1989). Generalizing the pac model for neural 
net and other learning applications. Technical Report UCSC-CRL-89-30, 
University of California, Santa Cruz. 


Haykin, 1994] Haykin, S. (1994). Neural Networks - A Comprehensive Foun- 
dation. Macmillan, New York. 


Hecht-Nielsen, 1990] Hecht-Nielsen, R. (1990). Neurocomputing. Addison- 
Wesley, Reading, Mass. 


Helmbold et al., 2000] Helmbold, D. P., Littlestone, N., and Long, P. M. 
(2000). On-line learning with linear loss constraints. Information and Com- 
putation, 161:140-171. 


[Hilario, 1997] Hilario, M. (1997). An overview of strategies for neurosymbolic 
integration. In Sun, R. and Alexandre, F., editors, Connectionist-Symbolic 
Processing, pages 13-35. Lawrence Erlbaum, Hillsdale, NJ. 


[Hinton and Sejnowski, 1987] Hinton, G. E. and Sejnowski, T. J. (1987). Learn- 
ing and relearning in boltzmann machines. In Rumelhart, D. E., McClelland, 
J. L., et al., editors, Parallel Distributed Processing: Volume 1: Foundations, 
pages 282-317. MIT Press, Cambridge. 


Hochbaum, 1997] Hochbaum, D. S., editor (1997). Approximation algorithms 
for NP-hard problems. PWS Publishing, Boston. 


Hyvarinen et al., 2001] Hyvarinen, A., Kahunen, J., and Oja, E. (2001). Inde- 
pendent Component Analysis. John Wiley&Sons. 


Hyvärinen and Oja, 1998] Hyvärinen, A. and Oja, E. (1998). Independent 
component analysis by general non-linear hebbianlike learning rules. Signal 
Processing, 64(3):301-313. 


[Jacobs et al., 1991] Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, 
G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 
3(1):79-87. 


Jolliffe, 1986] Jolliffe, I. T. (1986). Principal Component Analysis. Springer 
Verlag. 


Juran and Gryna, 1988] Juran, J. and Gryna, F. M., editors (1988). Juran’s 
Quality control handbook. McGraw Hill, 4th edition. 


Kearns and Li, 1988] Kearns, M. and Li, M. (1988). Learning in the presence 
of malicious errors. In Proc. 20th annual ACM Symposium on Theory of 
Computation, pages 267-280, New York. ACM Press. 


400 BIBLIOGRAPHY 


[Kendall and Stuart, 1961] Kendall, M. G. and Stuart, A. (1961). The advanced 
theory of statistics, volume 2: inference and relationship. Charles Griffin, 
London. 


Kleene, 1967] Kleene, S. C. (1967). Mathematical Logic. Wiley, New York. 


Knuth, 1997] Knuth, D. E. (1997). The art of computer programming. Volume 
1: fundamental algorithms. Addison-Wesley, Reading, Mass. 


Kohonen, 1989] Kohonen, T. (1989). Self-Organization and Associative Mem- 
ory. Springer-Verlag, Berlin, 3rd edition. 


Kohonen, 2001] Kohonen, T. (2001). Self-Organizing Maps. Springer Series in 
Information Sciences. Springer, Berlin, third edition. 


Kolmogorov, 1933] Kolmogorov, A. N. (1933). Grundberfriffe der wahrschein- 
lichkeitsrechnung. Ergeb. Math., 3. 


Kolmogorov, 1965] Kolmogoroy, A. N. (1965). Three approaches to the quan- 
titative definition of information problems. Problems of Information Trans- 
mission, 1(1):1-7. 


Laplace, 1812] Laplace, P. (1812). Théorie Analytique des Probabilités. 
Courcier, Paris. 


Laplace, 1814] Laplace, P. (1814). Essai philosophique sur les probabilités. 
Veuve Courcier, Paris. 


Le Cun, 1985] Le Cun, Y. (1985). Une procédure d’apprentissage pour résau 
à seuil asymmétrique. In Cognitiva 85: A la Frontière de l’Intelligence Ar- 
tificielle des Sciences de la Conneissance del Neurosciences, pages 599-604, 
Paris. CESTA. 


[Lehmer, 1964] Lehmer, D. H. (1964). The machine tools of combinatorics. In 
Beckenbach, E. F., editor, Applied Combinatorial Mathematics. Wiley, New 
York. 


[Levin et al., 1989] Levin, E., Tishby, N., and Solla, S. A. (1989). A statistical 
approach to learning and generalization in layered neural networks. In Rivest, 
R., Haussler, D., and Warmuth, M. K., editors, Proceedings of the Second An- 
nual Workshop on Computational Learning Theory, pages 245-260. Morgan 
Kaufmann. 


[Levin, 1974] Levin, L. A. (1974). Laws of information (nongrowth) and aspects 
of the foundation of probability theory. Problems of Information Transmis- 
sion, 10(3):206—210. 


[Levin, 1987] Levin, L. A. (1987). One-way functions and pseudorandom gen- 
erators. Combinatorica, 4(7):357-363. 


BIBLIOGRAPHY 401 


Lévy, 1979] Lévy, A. (1979). Basic set theory. Perspectives in mathematical 
logic. Springer-Verlag, Berlin. 


Li and Vitanyi, 1997] Li, M. and Vitanyi, P. (1997). An Introduction to Kol- 
mogorov Complexity and its Applications. Springer, Berlin, 2nd edition. 


Lindstrom and Bates, 1990] Lindstrom, M. and Bates, D. (1990). Nonlinear 
mixed effects models for repeated measures data. Biometrics, 46:673-687. 


Little, 2003] Little, M. C. (Accessed March 2003). C++sim home page. 
http://cxxsim.ncl. ac.uk. 


Lo, 1988] Lo, A. (1988). A bayesian bootstrap for a finite population. The 
Annals of Statistics, 16(4). 


MacKay, 1992] MacKay, D. J. C. (1992). A practical bayesian framework for 
backpropagation networks. Neural Computation, 4(3):415—447. 


MacKay, 2003] MacKay, D. J. C. (2003). Information Theory, Inference, and 
Learning Algorithms. Cambridge University Press, Cambridge. 


Mallat, 1998] Mallat, S. (1998). A Wavelet Tour of Signal Processing. AP 
Professional, London. 


Martello and Toth, 1979] Martello, S. and Toth, P. (1979). The 0-1 knapsack 
problem. In Combinatorial Optimization, pages 237-279. Wiley. 


Martin-Lof, 1966] Martin-Lof, P. (1966). The definition of random sequence. 
Information and Control, 9:602-619. 


Marubini and Valsecchi, 1995] Marubini, E. and Valsecchi, M. G. (1995). Ana- 
lyzing survival data from clinical trials and observational studies. John Wiley 
& Sons, Chicherster. 


Mascheroni, 1790] Mascheroni, L. (1790). Adnotationes ad calculum integralem 
Euleri. P. Galeazzi, Ticini. 


Medsker, 1995] Medsker, L. R. (1995). Hybrid intelligent systems. Kluwer 
Academic Publishers, Boston. 


Meeker and Escobar, 1995] Meeker, W. and Escobar, L. A. (1995). Teaching 
about approxiamte confidence regions based on maximum likelihood estima- 
tion. The American Statistician, 49(1):48-53. 


[Micheli et al., 1999] Micheli, A., Francisci, S., Krogh, V., Giorgi Rossi, A., 
Crosignani, P., and the ITAPREVAL Working Group (1999). Cancer preva- 
lence in italian cancer registries areas: the itapreval study. Tumori, 85:309- 
369. 


[Minsky and Papert, 1988] Minsky, M. and Papert, S. (1988). Perceptrons : an 
introduction to computational geometry. MIT Press, Cambridge, Mass. 


402 BIBLIOGRAPHY 


Mitchell, 1997] Mitchell, T. M. (1997). Machine Learning. McGraw-Hill series 
in computer science. McGraw-Hill, New York. 


Mood et al., 1974] Mood, A. M., Graybill, F. A., and Boes, D. C. (1974). In- 
troduction to the Theory of Statistics. McGraw-Hill, New York. 


Morrison, 1967] Morrison, D. F. (1967). Multivariate Statistical Methods. 
McGraw-Hill, New York. 


Nauck et al., 1997] Nauck, D., Klawonn, F., and Kruse, R. (1997). Foundations 
of neuro-fuzzy systems. John wiley, Chichester, New York. 


Nemhauser et al., 1989] Nemhauser, G. L., Rinnooy Kan, A. H. G., and Todd, 
M. J., editors (1989). Optimization. Handbooks in Operations Research and 
Management Science. North-Holland, Amsterdam. 


[Neyman, 1935] Neyman, J. (1935). Su un teorema concernente le cosiddette 
statistiche sufficienti. Giornale dell’Istituto Italiano degli Attuari, 6:320-324. 


[Neyman and Pearson, 1933] Neyman, J. and Pearson, E. S. (1933). On the 
problem of the most efficient tests of statistical hypotheses. Transactions of 
the Royal Society of London, Series A, 231:289-337. 


[Oja, 1989] Oja, E. (1989). Neural networks, principal components, and sub- 
spaces. International Journal of Neural Systems, 1:61-68. 


[Paass, 1991] Paass, G. (1991). Integrating probabilistic rules into neural net- 
works: A stocastic em learning algorithm. In UAI ’91: Proceedings of the 
Seventh Annual Conference on Uncertainty in Artificial Intelligence, pages 
264-270. Morgan Kaufmann. 


[Papadimitriou, 1994] Papadimitriou, C. H. (1994). Computational complexity. 
Addison-Wesley, Reading, Massachusetts. 


[Patron-Bizet et al., 1998] Patron-Bizet, F., Mentre, F., Genton, M., Thomas- 
Haimez, C., and Maccario, J. (1998). Assessment of the global two-stage 
method to ECs9 determination. Journal of Pharmacological and Toxicological 
Methods, 39:103-108. 


[Pedrycz, 2001] Pedrycz, W. (2001). Granular computing in data mining. In 
Last, M. and Kandel, A., editors, Data Mining & Computational Intelligence, 
Physica-Verlag, Studies in Fuzziness and Soft Computing, Vol. 68. Springer- 
Verlag. 


[Percus, 1971] Percus, J. K. (1971). Combinatorial Methods. Springer, New 
York. 


[Peterson and Anderson, 1987] Peterson, C. and Anderson, J. R. (1987). A 
mean field theory learning algorithm for neural networks. Complex Systems, 
1:995-1019. 


BIBLIOGRAPHY 403 


Piaget, 1969] Piaget, J. (1969). Biologie et connaissance. Gallimard, Paris. 


Pitt and Valiant, 1988] Pitt, L. and Valiant, L. (1988). Computational limita- 
tions on learning from examples. Journal of the ACM, 4(35):965-984. 


Poggio and Girosi, 1990a] Poggio, T. and Girosi, F. (1990a). Networks for 
approximation and learning. Proceedings of IEEE, 78:1481—1497. 


Poggio and Girosi, 1990b] Poggio, T. and Girosi, F. (1990b). Regularization 
algorithms for learning that are equivalent to multilayer networks. Science, 
247:978-982. 


[pollution and mortality, 2005] pollution, A. and mor- 
tality, U D- œ- Lə S- (2005): SMSA dataset. 
http://lib.stat .cmu.edu/DASL/Datafiles/SMSA.html. 


Qunilan, 1993] Qunilan, J. R. (1993). C4.5: programs for machine learning. 
Morgan Kaufmann Publishers, San Mateo, California. 


Rao, 1949] Rao, C. R. (1949). Sufficient statistics and minimum variance esti- 
mates. Proc. Camb. Phil. Soc., 45:213-218. 


Reed, 1993] Reed, R. (1993). Pruning algorithms — a survey. IEEE Transac- 
tions on Neural Networks, 4(5):740-747. 


Ripley, 1987] Ripley, B. D. (1987). Stochastic Simulation. Wiley Series in 
Probability and Mathematical Statistics. John Wiley & Sons, New York. 


Roger, 1967] Roger, H. (1967). Theory of recoursive functions and effective 
computability. Mc Graw-Hill. 


Rohatgi, 1976] Rohatgi, V. K. (1976). An Introduction to Probablity Theory 
and Mathematical Statistics. Wiley Series in Probability and Mathematical 
Statistics. John Wiley & Sons, New York. 


[Ross, 1987] Ross, S. M. (1987). Introduction to probability and statistics for en- 
gineers and scientists. Wiley series in probability and mathematical statistics. 
Wiley, New York. 


[Ross, 1997] Ross, S. M. (1997). Simulation. Statistical Modeling and Decision 
Science. Academic press, San Diego, second edition. 


[Rousseeuw and Hubert, 1999] Rousseeuw, P. and Hubert, M. (1999). Regres- 
sion depth (with discussion). Journal of American Statistical Association, 
94:388—433. 


[Roy et al., 1997] Roy, A., Govil, S., and Miranda, R. (1997). A neural network 
learning theory and a polynomial time rbf algorithm. IEEE Transactions on 
Neural Networks, 8(6):1301-1313. 


404 BIBLIOGRAPHY 


Rudin, 1974] Rudin, W. (1974). Functional analysis. Tata McGraw-Hill, New 
Delhi. 


Rumelhart, 1986] Rumelhart, D. E., editor (1986). Parallel Distributed Pro- 
cessing, volume 1. MIT Press, Cambridge. 


Saad, 1998] Saad, D., editor (1998). On-Line Learning in Neural Networks. 
Cambridge University Press. 


Scheffe and Tukey, 1945] Scheffe, H. and Tukey, J. W. (1945). Non-parametric 
estimation, i. validation of order statistics. Annals of Mathematical Statistics, 
16:187-192. 


[Schélkopf et al., 1999] Schölkopf, B., Burges, C. J. C., and Smola, A. J., editors 
(1999). Advances in kernel methods : support vector learning. MIT Press, 
Cambridge, Mass. 


Selman and Kautz, 1996] Selman, B. and Kautz, H. (1996). Knowledge com- 
pilation and theory approximation. Journal of the ACM, 43(2):193-224. 


Sen and Srivastava, 1990] Sen, A. and Srivastava, M. (1990). Regression Anal- 
ysis, Theory, Methods and Applications. Sprenger-Verlag. 


Shoenfield, 1967] Shoenfield, J. R. (1967). Mathematical logic. Addison-Wesley, 
Reading, Mass. 


Smirnov, 1961] Smirnov, V. I. (1961). Linear Algebra and Group Theory. 
McGraw-Hill, New York. 


Solomonoff, 1964] Solomonoff, R. J. (1964). A formal theory of inductive in- 
ference: part 2. Information and Computation, 7(2):224,254. 


Student, 1908] Student (1908). The probable error of a mean. Biometrika, 
6(1):1-25. 


Sussmann, 1988] Sussmann, H. J. (1988). Learning algorithms for boltzmann 
machines. In Proceedings of the 27-th Conference on Decision and Control, 
Austin. 


[Sutton and Barto, 1998] Sutton, R. S. and Barto, A. G. (1998). Reinforcement 
learning: an introduction. Adaptive computation and Machine Learning. The 
MIT Press, Cambridge. 


Thomas, 1995] Thomas, S. F. (1995). Fuzziness and Probability. ACG Press, 
Wichita, USA. 


Tornay, 1938] Tornay, S. C. (1938). Ockham: Studies and Selections. Open 
Court, La Salle, IL. 


Tukey, 1947] Tukey, J. W. (1947). Nonparametric estimation, ii. statistically 
equivalent blocks and multivariate tolerance regions. the continuous case. An- 
nals of Mathematical Statistics, 18:529-539. 


BIBLIOGRAPHY 405 


[Valiant, 1984] Valiant, L. G. (1984). A theory of the learnable. Communica- 
tions of the ACM, 11(27):1134-1142. 


[van Laarhoven and Aarts, 1987] van Laarhoven, P. J. M. and Aarts, E. H. L. 
(1987). Simulated annealing: theory and applications. Mathematics and its 
applications. D. Reidel, Dordrecht. 


Vapnik, 1982] Vapnik, V. (1982). Estimation of dependencies based on empir- 
ical data. Springer, New York. 


Vapnik, 1995] Vapnik, V. (1995). The Nature of Statistical Learning Theory. 
Springer, New York. 


Vapnik, 1998] Vapnik, V. (1998). Statitical Learning Theory. John Wiley & 
Sons, New York. 


Wegener, 1987] Wegener, I. (1987). The complexity of Boolean Functions. 
Teubner and Wiley. 


Werbos, 1990] Werbos, P. (1990). Backpropagation through time: What it 
does and how to do it. IEEE Proceedings, 78(10):1550—1560. 


Wermter and Sun, 2000] Wermter, S. and Sun, R., editors (2000). Hybrid neu- 
ral systems. Lecture notes in computer science 1778. Springer-Verlag, Berlin. 


Wilks, 1962] Wilks, S. S. (1962). Mathematical Statistics. Wiley Publications 
in Statistics. John Wiley, New York. 


Williams and Zipser, 1989] Williams, R. J. and Zipser, D. (1989). A learning 
algorithm for continually running fully recurrent neural networks. Neural 
Computation, 1(2):270-280. 


Wolfram, 2003] Wolfram, S. (2003). The Mathematica book. Wolfram Media, 
5th edition. 


Zadeh and Kacprzyk, 1992] Zadeh, L. A. and Kacprzyk, J., editors (1992). 
Fuzzy logic for the management of uncertainty. Wiley, New York. 


Zubrzycki, 1972] Zubrzycki, S. (1972). Lectures in Probability Theory and 
Mathematical Statistics. American Elsevier Publishing Company, inc./PWN 
— Polish Scientific Publishers, New York, Warszawa. 


Index 


abstraction level, 188 
accuracy parameters, 146, 179-181, 
262, 288 
activation function, 245, 250, 269, 
297 
Heaviside, 246, 260 
linear, 246, 255 
probabilistic, 247, 283 
sigmoid, 247, 255, 267, 295 
activation mode, 245, 248 
asynchronous, 246, 250 
delayed, 246 
parallel, 246, 250, 279 
random, 246, 258, 263 
additive noise, 199, 214 
adequacy 
of sample size, 128-131 
test, 233, 238, 240 
algorithmic expansion, 
288 
algorithmic inference, 57, 58, 74, 
159, 301 


147, 154, 


b.rectangles, see learning 
back-propagation, 270, 
283, 295, 301 
bagging, 300 
batch, 172, 289, 290, 300 
Bayes’ formula, 38, 258 
Beta function, 346 
incomplete, 99, 157 
bias-variance problem, 
285, 301 
binomial coefficient, 14, 310 
Boltzmann Machine, 250, 281, 290, 
301 
asymmetric, 250 


272-275, 


165, 193, 


407 


by the way learning, 290, 292 
clamped mode, 280, 282 
intermediate belt, 251, 259, 281 
parallel, 279 
sequential, 282, 283 
thermal core, 250, 259, 281, 291 
unclamped mode, 279, 282 
visible belt, 251, 279, 291 
Boolean functions, 139, 145, 175, 
242, 258, 269, 287-289 
bootstrap, 67, 134, see p-bootstrap 
border 
inner, 187-189 
outer, 187-189 
Box-Muller method, 350 


c.d.f., see cumulative distribution 
function 
C4.5 decision tree, 295 
cardinality, 6 
Cauchy-Schwarz inequality, 372 
ceiling family, 205 
central limit theorem, 53, 134, 173, 
372 
chain rule, 39, 355 
Chapman-Kolmogorov 
382 
Chebyshev inequality, 125, 351 
Chernoff inequality, 175, 178, 193 
Church-Turing thesis, 4, 252, 389 
class 
NP, 387 
-complete, 387 
-easy, 388 
-hard, 138, 259, 262, 268, 388 
P, 161, 386 
RP, 388 


equation, 


408 


finitely recoverable, 180 
learnable, 189 
well behaving, 168 
classification, 298 
clause, 162, 165, 190 
canonical, 188 
monotone, 151, 169, 185 
combination, 306, 310, 314 
with repetition, 312 
without repetition, 310, 314 
combinatorial calculus, 11, 55, 305 
assignment, 305 
compatible, 58, 59, 146 
complement set, 8 
computational complexity, 55, 138, 
162, 192, 237, 329, 385 
computational context, 4, 11, 17, 
50, 109 
concentration ellipses, 366 
concept, 140, 145 
almost consistent, 164, 168, 
178 
class, 145, 146 
consistent, 155 
confidence interval, 59, 96 
bilateral, 97 
extremes, 99 
feasible, 176 
for learning error, see learning 
level, 96 
specific examples, see distribu- 
tion 
statistical approach, 133 
symmetric, 99 
unilateral, 96, 97 
confidence region, 104, 108 
for regression curves, see re- 
gression 
configuration 
in combinatorial calculus, 305, 
307-311, 313 
in neural networks, 256 
with repetition, 305 
without repetition, 305 
congruential generator, 326 


INDEX 


Conjunctive Normal Form (CNF), 
see learning 
connection, 244, 246, 248-250, 258, 
262, 272, 283, 290-293, 
297 
connection weight, 245, 246, 248, 
250, 255, 267, 274, 280, 
290, 297, 299 
CONSISTENCY problem, 161, 190, 
253, 262 
consistency rule, 3 
consistent 
Boolean function, 146 
concept, see concept 
consistent estimator, see estimator 
contours of a partition, 76, 77, 82- 
84, 93 
convergence, 15 
in probability, 17, 129, 139 
convergence in mean square, 127 
convolution, 370 
cooling schedule, 266, 273 
correlation coefficient, 360, 361 
cost function, 257, 259, 263, 265, 
269, 279, 282 
covariance, 360, 362 
matrix, 366 
coverage, 103, 144, 216 
crosstalk, 297 
cumulative distribution function, 
25, 319 
conditional, 354, 363 
empirical, 33, 327 
geometric meaning, 322 
joint, 353, 357 
marginal, 354, 357, 363 
specific examples, see distribu- 
tion 
curse of dimensionality, 165 
curve distribution, 197, 200, 203, 
205, 211, 222, 224, 225, 
231 


d.f., see density function 
DAMPING procedure, 273 
De Morgan laws, 9 


INDEX 


debt, 95 
decision problem, 161, 386-388 
degrees of freedom 
of distribution, 132, 216 
of sample, 94, 95, 288 
of statistic, 95 
density function, 25, 321 
conditional, 363, 366 
joint, 363, 366 
marginal, 363 
specific examples, see distribu- 
tion 
detail 
inner, 184 
outer, 148, 149, 153, 164, 165, 
174, 175, 182, 185, 186, 
190, 193, 287 
relation with VC dimension, 
174, 185 
detailed balance, 250, 382 
Disjunctive Normal Form (DNF), 
see learning 
disposition, 306 
with repetition, 309 
without repetition, 308 
distribution, 24, see random vari- 
able 
g-bounded, 191 
a priori, 54, 73, 111, 258 
b_rectangle measure, 143 
Bernoulli, 21, 28, 33, 46, 62, 72, 
77, 99, 103, 110, 127-129, 
333, 338, 364, 371 
Beta, 54, 99, 143, 182, 233, 346 
binomial, 21, 29, 33, 52, 79, 87, 
89, 119, 121, 334, 337, 339, 
340, 371 
bounded measure domain, 87, 
90 
Chi square, 97, 102, 106, 108, 
111, 347 
conditional, 87, 354 
continuous uniform, 29, 33, 47, 
60, 64, 65, 68, 84, 90, 91, 
104, 112, 118, 118, 123, 
326, 342, 348, 368 


409 


discrete uniform, 332, 337 

Fisher, 238 

Gamma, 52, 92, 345, 352, 370, 
371 

Gaussian, 42, 52-54, 89, 93, 
101, 106, 108, 111, 128, 
344, 349, 371, 372 

Gaussian bidimensional, 364 

geometric, 22, 49, 335, 337, 339 

hypergeometric, 22, 70, 90, 
103, 314, 359 

marginal, 91 

multinomial, 362 

negative exponential, 36, 43, 
48, 64, 66, 67, 89, 97—99, 
103, 112, 123, 220, 340, 
343, 348, 351, 371 

Normal, 94, 102, 111, 128, 331, 
369, 372 

of the learning error, 155, 156, 
164, 167, 169 

Poisson, 41, 123, 336, 340, 341, 
371 

reproducible, 371 

Student’s t, 107, 216, 347 

symmetric exponential, 68, 85, 
103 

unknown, 117 

Weibull, 222, 227 

distribution dependent complexity, 

166, 180 


ensemble of networks, 258, 284, 301 
entropy, 119 
relative, 120, 258, 265, 279 
e-closeness, 180 
e-cover, 180, 192 
equivalent blocks, 233 
error probability, 146, 173, 177 
errors 
disingenuous, 165, 182, 191 
malicious, 165, 191 
estimator, 31, 60 
consistent, 124 
in mean square, 126, 127 
efficiency, 237, 238 


410 


maximum likelihood, 121, 122, 
124, 197, 225, 232, 233, 
235, 331, 334 
method of moments, 117 
plug in, 122 
point, 59, 108, 113 
UMVUE, 116 
unbiased, 110 
weakly unbiased, 114, 124 
specific, see distribution 
event, 6 
algebra of, 6, 8, 9, 21, 22 
anomalous, 7, 8 
crowded, 43 
disjoints events, 20, 45, 46 
elementary, 6-9, 14, 20, 23, 45, 
312-314 
impossible, 20, 24 
joint occurence, 37 
rare, 43, 338 
sigma-algebra, 9 
sub-event, 8 
events 
intersection, 21, 46 
union, 9, 21 
evolution step, 249, 269, 279 
example, 146 
expectation operator, 324 
expected value, 27-31, 47, 323, 358, 
360, 372 
specific examples, see distribu- 
tion 
experiment, 6 
probabilistic, 19 
random, 17 
space, 6, 7, 58 
explaining function, 61, 75, 77, 80- 
82, 84, 87, 90, 113, 200, 
221, 229, 242, 251, 279, 
301 
exponential family, 89, 201 
multiparametric, 93 


factorial, 308 
factorization cirterion, 91 


INDEX 


factorization criterion, 88, 90, 122, 
201, 254 
factorization principle, 305, 307 
family 
regular, 80 
separable, 80 
feature space, 6-7, 137 
formula 

monotone, 171 
frequency, 11, 12, 17, 23, 28, 31, 
34, 40, 50, 63, 64, 72, 73, 
96, 103, 104, 110, 117, 125, 
127, 142, 144, 156, 258, 
262, 326, 327, 381 


frontier 
inner, 184 
outer, 148 
Functions of random variable, 46 
c.d.f., 368 
d.f., 369 
expected value 
continuous, 31, 324, 364 
discrete, 30, 47, 116, 324, 358 
sample approximation, 31 
variance, 116 
linear combinations, 324, 325 
minimum/maximum, 380 
mixture, 118 
product, 360, 368, 372 
ratio, 368, 372 
sum, 52, 360, 370-372 
fuzzy sets, 55, 139, 239 


generalization, 246, 251, 263, 287, 
289 
generalized inverse, 60 
Glivenko-Cantelli theorem, 34 
gooseneck, 76, 82-83 
gradient, 270 
gradient descent, 237, 268, 270-272, 
279-283 
backward procedure, 270 
forward procedure, 270 
greedy algorithm, 169, 265 


H_large.asneeds learning proce- 


INDEX 


dure, 166 
hazard function, 220 
confidence region, see regres- 
sion 
heuristic, 259 
histogram, 34 
hybrid system, 244, 276, 294-296 
hyperclause, 188 
hypermonomial, 188 
hypothesis, 140, 142, 146, 159-164, 
262, 287, 288 
computing consistent hypothe- 
ses, 162 


imprinting mechanism, 297 
incremental methods, 264 
independence, 16 
between events, 16, 37 
between random variables, 44, 
355, 360, 370 
independence of sample mean and 
sample variance, 95, 372 
independent component analysis, 
300 
indicator function, 34 
inductive bias, 138, 188 
inference, 58 
algorithmic, see algorithmic in- 
ference 
Bayesian, 73 
classical framework, 59, 114- 
193 
predictive, 55 
input nodes, 248 
invading concept, 148, 184 
inverse problem, 96, 172, 348 
inverse transform method, 48, 49, 
60, 328, 330 


Jacobian, 270 
Jensen inequality, 373 


k-means algorithm, 300 
kernel, 164 
Kolmogorov 

axioms, 18, 20 


411 


complexity, 243, 251, 301 
framework, 18, 38, 55, 114, 119, 
128, 133, 165, 367 
Kullback distance, see entropy, rel- 
ative 


labeled 
points, 142 
sample, 146 
lack of memory, 352 
language, 386 
large numbers law, 127, 134, 138 
layer, 248, 255, 272 
learnability, 189 
learning, 59, 137 
active, 171 
algorithm, 139, 146, 154, 243, 
244 
b_rectangles, 142, 143, 151, 
160, 165, 167, 170, 180 
circles, 139, 151, 160, 186, 273 
Conjunctive Normal Form, 
162, 186, 190 
k-, 162, 190 
k-term, 162, 187 
Disjunctive Normal Form, 186 
k-term, 162, 171, 187, 190 
monotone, 171 
error 
confidence interval, 172, 175 
point estimator, 182 
variance, 183 
half lines, 152, 174 
hyperplane, 152, 163, 167, 169, 
182 
mirroring, 290, 291 
monotone clauses, 151 
non proper, 163 
on-line, 169, 192, 287, 289 
polygons, 152, 169, 186 
proper, 163 
segments, 152, 159 
subsymbolic, 138, 239, 244, 299 
symbolic, 138, 244 
unsupervised, 296, 301 
XOR, 266, 273, 283 


412 


learning cycles, 268 

learning rate, 272, 299 

limit theorems, 52 

list augmenting algorithm, 171 

list pruning algorithm, 162 

LocalStochasticOptimization 
algorithm, 265 

loss function, 109, 298 

Lyapunov function, 259 


marginalization, 38, 279 
Markov chain, 250, 259, 264, 282, 
381 
homogeneous, 381 
Markov inequality, 125 
master equation, 66, 67, 70, 90, 101, 
103, 382 
Mean Square Error, 110, 241, 242, 
257 
mean value, see expected value 
measurement error, 256 
median, 36, 69, 86, 103, 232, 235, 
236 
metaclause, 188 
metamonomial, 188 
method 
maximum likelihood, 120, 225, 
229, 235, 254 
of moments, 116 
peeling, 198, 224, 227, 231 
plug in, 122, 200, 207, 224 
Metropolis algorithm, 264, 267, 301 
Mexican hat, 299 
misunderstanding, 124 
mode, 36, 122, 197, 225, 232, 235 
model, 7 
Bernoulli, 21 
binomial, 21, 313 
equiprobable, 19 
Gaussian, 42 
hypergeometric, 22, 314 
linear, 121, 195, 324 
Poisson, 41 
Weibull, 222 
moment 
central, 325 


INDEX 


generating function, 370 
of a random variable, 31, 325 
of a sample, 32 
vs. moment generating func- 
tion, 371 
monomial, 162 
canonical, 187 
monotone, 182, 185 
MSE, see Mean Square Error 
multinomial coefficient, 362 


nested regions, 197, 207-211, 224, 
227, 231 
net input, 246 
neural network, 243, 244, 257, 258, 
300 
approximability, 255 
discretized output, 269 
feed-forward, 248, 255, 272, 
299 
recurrent, 248, 269-271, 276, 
278 
unfolded, 249 
neuro-control, 276-278 
neuron, 246, 262, 269 
Newton formula, 29, 310 
NN, see neural network 


Occam razor, 257 

one way function, 51, 56, 57, 161 

optimal margin, 163, 167, 182 

optimization problem, 109, 163, 
166, 259, 264 

Oracle, 171 

order statistic, 54, 85, 113, 233 

outcome space, 7 

output node, 248 

overfitting, 256, 257, 288 


P Æ NP dilemma, 387 
p-bootstrap, 67, 74, 90, 103, 124 
of a Bernoulli variable, 66 
of a continuous uniform vari- 
able, 67 
of a hypergeometric variable, 
70 


INDEX 


of a negative exponential vari- 
able, 66, 67 
of a symmetric exponential 
variable, 68 
of regression curves, 224 
of regression parameters, 223, 
225, 231 
parameter distribution, 63, 64, 67, 
71, 81, 95, 103, 113 
of a Bernoulli variable, 72, 99, 
333 
of a binomial variable, 334 
of a continuous uniform vari- 
able, 91, 104, 342 
of a discrete uniform variable, 
332 
of a Gamma variable, 92, 345 
of a Gaussian variable, 101, 
106, 344 
of a geometric variable, 335 
of a hypergeometric variable, 
103 
of a negative exponential vari- 
able, 73, 97, 343 
of a Poisson variable, 336 
of a symmetric exponential 
variable, 85, 103 
partial recursive function, 192, 252, 
389 
partition induced by a statistic, 77, 
78, 82 
path’s graph, 82 
PE, see processing element 
PerceLearning algorithm, 260, 
262, 263, 287 
percentile, 322 
perceptron, 260 
multilayer, 266, 272 
permutation, 11, 306, 309, 314 
pivot, 74, 86, 87, 197, 206, 211, 227, 
232, 259 
pivotal quantity, 102, 128, 129 
pledge points, 95, 216, 242 
population, 31, 58 
power set, 9 
prefix complexity 


413 


conditional, 252 
unconditional, 252 
vs. probability, 253 
prefix function, 252 
principal component analysis, 300 
probabilistic experiment, 19 
probability, 11 
atoms, 76, 80, 82 
axioms, 18 
Bayesian, 54 
combinatorial, 10, 313 
conditional, 37 
fiducial, 54 
function, 24, 321 
conditional, 354 
joint, 353, 356 
marginal, 354 
specific examples, see distri- 
bution 
mass, 24 
measure, 18 
model, 19 
space, 18 
probability integral transformation 
theorem, 47 
Probably Approximately Correct 
learning model, 145, 175 
processing element, 244, 246, 279 
subsymbolic, 244, 250 
symbolic, 244 
pruning, 256 
pseudogaussian norm, 299 
PTM, see Turing Machine 
punishment in training, 263 


quantile, 58, 96, 322 


radial basis function, 299 
random function, 147, 196, 224, 239 
random variable, 23, 319 

aggregate of, 44 

continuous, 24, 320 

discrete, 24, 320 

function of, 47, 367 

generator of, 47, 49-51, 56, 

326-328, 330, 337, 339 


414 


joint, 46, 352 
mixed, 27, 320, 322 
mixture, 118 
regular, 80 
scale change, 380 
specification, 23 
uncorrelated, 360 
regression, 195 
confidence region, 196-198 
for hazard function, 220 
with exponential error, 217 
dependent variable, 211 
line, 199 
linear, 198 
multidimensional, 218 
non linear, 219, 230 
normal equations, 235 
point estimators, 234 
sufficient statistics, 202 
with Gaussian error 
line, 367 
regularity conditions, 77, 80, 242 
regularization term, 258 
reinforcement learning, 264 
reliability, 123, 348 
reward in training, 263 
risk 
actual, 173 
function, 109, 241, 242 
rule of succession, 111 


sample, 31, 58, 60, 326 
cause/effect, 196 
labeled, 146 
likelihood, 88, 120 
minimum/maximum of, 380 
moments, 32 
separable, 163 
set, 61 
shattered, 173, 174 
size, 127 
space, 61, 320 
specification, 60 
support, 61 
sample complexity, 177, 180, 190, 
192, 193 


INDEX 


distribution dependent, 180- 
182 
lower bound, 182 
upper bound, 180 
distribution free, 177—180 
lower bound, 179 
upper bound, 179 
sample mean, 331, 380 
expected value, 373 
variance, 373 
sample variance, 373, 380 
sampling 
Bernoulli variable, 62, 333 
Beta variable, 346 
binomial variable, 334, 339 
continuous uniform variable, 
65, 342 
discrete uniform variable, 332 
Gamma variable, 345 
Gaussian variable, 94, 344, 350 
geometric variable, 335, 338 
hypergeometric variable, 70 
mechanism, 61 
negative exponential variable, 
66, 343 
Poisson variable, 336 
symmetric exponential vari- 
able, 69 
variants, 61, 94 
Weibull variable, 221, 222 
with replacement, 22, 313 
without replacement, 22, 313 
search problem, 268, 387 
self-associative memories, 296 
capacity, 297 
self-organizing maps, 299 
self-organizing memories, 298 
sentinel, see sentry point 
sentry function 
inner, 183 
outer, 148 
sentry point, 148, 167, 168, 183, 287 
separator hyperplane, 260 
shattered set, 173 
signature, 157, 169 
Simulated Annealing, 266 


INDEX 


simulation of pseudorandom num- 
bers, 51 
soft-margin classifier, 166, 168, 169 
square error, see MSE 
standard deviation, 42, 326 
state vector, 245, 381 
stationary state, 248, 383 
statistic, 60 
r-bounded increment (r-bi), 
81, 83 
joint sufficient, 88, 93, 122 
minimal sufficient, 77, 110 
specific examples, see distribu- 
tion 
sufficient, 75, 77, 87, 90, 110, 
122 
well behaving, 65, 67, 69, 70, 
75, 85, 123, 134, 227, 234 
statistical approach, 133 
stochastic 
matrix, 382 
vector, 381 
strongly surjective, 155 
fairly, 155 
structural risk minimization, 138, 
165, 193, 254 
Student’s t, 351 
sufficiency rule, 75-79 
support, 146 
support vector, 163, 167-169, 176, 
287 
Support Vector Machines, 161, 163 
SVM, see Support Vector Machines 
symmetric difference, 137, 146, 153- 
157, 164, 167, 185, 192, 
287, 316 
symmetry ensemble, 4, 17 
augmented, 11, 316 
fully expanded, 11, 16 
symmetry rule, 3 


target concept, 139 

test of hypothesis, 115, 133, 192, 
238 

test set, 243, 246, 256, 287-290 

threshold in neural networks 


415 


net input, 246 
total probabilities theorem, 38 
training algorithm, 246, 251, 256, 
267, 280, 287—290 
training set, 246, 251, 256, 258, 267, 
268, 287, 290, 291 
transition matrix, 259, 279, 280, 
381-383 
traveling salesman problem, 169 
truncation parameter, 90 
Turing Machine, 248 
Deterministic, 253, 295, 385 
Non Deterministic, 387 
Probabilistic, 388 
program, 385 
universal, 389 
twisting argument, 71, 74, 124 
for b_rectangles, 143 
for Bernoulli distribution, 71, 
333 
for binomial distribution, 334 
for continuous uniform distri- 
bution, 91, 342 
for discrete uniform distribu- 


tion, 332 

for Gamma distribution, 92, 
345 

for Gaussian distribution, 94, 
344 


for geometric distribution, 335 
for linear regression 
line, 204-206 
parameters, 202 
for negative exponential distri- 
bution, 72, 343 
for Poisson distribution, 336 
for probabilistic neural net- 
work, 259 
for symmetric exponential dis- 
tribution, 86 
for Weibull parameters, 223, 
227 
learning, 154 


uniformity rule, 3 


416 


validation set, 246 
Vapnik-Chervonenkis 
dimension, 149, 173-175, 
179, 192, 193, 287 
variance, 372 
of a random variable, 32, 325, 


360, 361, 372 
of a sample, see sample vari- 
ance 
specific examples, see distribu- 
tion 
VC dimension, see Vapnik- 


Chervonenkis dimension 
Venn diagram, 45 
virtual equiprobability, 75, 79, 83, 
93 


wavelets, 239 

weight vector, 245, 257 

witness point, 142, 143, 145, 154, 
156, 157, 287, 289 


INDEX 


